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HUMAN GENOME -DERIVED SINGLE EX ON NUCLEIC ACID PROBES USEFUL 
FOE ANALYSIS OF GENE EXPRESSION IN HUMAN HEART 

CROSS REFERENCE TO RELATED APPLICATIONS 

5 

The present application is a continuation-in-part of U.S. 
patent application serial nos . 0-9/632,366, filed August 3, 
2000 and 09/508,408, filed June 30, 2000; claims the 
benefit under 35 U.S.C. s 119(e) of U . S . provisional patent 
10 application serial nos. 60/236,359, filed September 27, 

filed May 26, 2000, and 60/180,312, filed February 4, 2000; 
and further claims the benefit under 35 U.S.C. s 119(a) of 
UK patent application no. 0024263.6, filed October 4, 2000, 
15 the disclosures of which are incorporated herein by 
reference in their entireties. 

REFERENCE TO SEQUENCE LISTING AND INCORPORATION BY 
REFERENCE THEREOF 

20 

The present application includes a Sequence Listing in 

pnhrnni o f n irrr\ p +- f t 1 prl r-n t v- o n n +~ -h ^ P^T 7> -1 -r- -! 0 +- y- 1 i-r — 

Instructions 801 - 806 on a single CD~R disc, in 
triplicate, containing a file named pto_HEART.txt, created 
25 24 January 2001, having 20,186,946 bytes. The Sequence 

Listing contained in said file on said disc is incorporated 
herein by reference in its entirety. 

Field of the Invention 

30 

The present invention relates to genome-derived 
single exon microarrays useful for verifying the expression 
of regions of genomic DNA predicted to encode protein. In 
particular, the present invention relates to unique genome- 
35 derived single exon nucleic acid probes expressed in human 

1 
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hear, ana single exon nucleic acre rr.icr ^arrays that inciud 



Background of the Invention 
5 For almost two decades follov/ing the invention cr 

general techniques for nucleic acid sequencing, Sanger et 
al., Proc. Natl. Acad. Sci. USA 70 (4 ): 1209-13 (1973); 
Gilbert et al. f Proc. Natl. Acad. Sci. USA 70 (12) : 3581-4 
(1973) , these techniques were used principally as tools to 

10 further the understanding of proteins — known or 

suspected - about which a basic foundation of biological 
knowledge had already been built. In many cases, the 
cloning effort that preceded sequence identification had 
been both informed and directed by that antecedent 

15 biological understanding. 

For example, the cloning of the T cell receptor 
for antigen was predicated upon its known or suspected cell 
type-specific expression, by its suspected membrane 
association, and by the predicted assembly of its gene via 

20 T cell-specific somatic recombination. Subsequent 
sequencing efforts at once confirmed and extended 
understanding of this family of proteins. Hedrick et al, r 
Nature 308 (5955) : 153-8 (1984). 

More recently, however, the development of high 

25 throughput sequencing methods and devices, in concert with 
large public and private undertakings to sequence the human 
and other genomes, has altered this investigational 
paradigm: today, sequence information often precedes 

30 product . 

One of the approaches to large-scale sequencing 
is predicated upon the proposition that expressed 
sequences - that is, those accessible, through isolation of 
mRNA - are of greatest initial interest. This "expressed 
35 sequence tag" ( TT EST TT ) approach has already yielded vast 
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arr:-un;s of sequence aa:a 'see for example Adams e: o .' . , 
Science 2 52:1 651 1 9 9 1 ; Wi 1 1 isms on , DrL?7 Cisco-/. Icday 
4:115 (1999)). For nucleic -cids sequenced by this 
approach, often tne only biological information that is 
5 known a priori wi:h any certainty is the likelihood of 
biologic expression itself. By virtue of the species and 
tissue from whicn the m.RNA nad originally been obtained, 
most such sequences are also annotated with the identity of 
tne species and at least one tissue in which expression 

10 appears likely. 

More recently, the pace of genomic sequencing has 
accelerated dramatically. When genomic DNA serves as the 
initial substrate for sequencing efforts, expression cannot 
be presumed; often the only a priori biological information 

15 about the sequence . includes the species and chromosome (and 
perhaps chromosomal map location) of origin. 

With the ever-accelerating pace cf sequence 
accumulation by directed, EST, and genomic sequencing 
approaches — and in particular, with the accumulation of 

20 sequence information from multiple genera, from multiple 

species within genera, and from multiple individuals within 
a species — there is an increasing need for methods that 
rapidly and effectively permit the functions of nucleic 
sequences to be elucidated. And as such functional 

25 information accumulates, there is a further need for 
methods of storing such functional information in 
meaningful and useful relationship to the sequence itself; 
that is, there is an increasing need for means and 
apparatus for annotating raw sequence data with known or 

30 predicted functional information. 

Although the increase in the pace of genomic 
sequencing is due in large part to technological changes in 
sequencing strategies and instrumentation, Service, Science 
280:995 (1998); Pennisi, Science 283: 1822-1823 (1999), 

35 there is an important functional motivation as well. 
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While it was understood that the EST aoorcach 
would rarely be able :c yield sequence information about 
the noncodir.: portions of the jenoie, it nov; also appears 
the EST approach is capable of capturing only a fraction cf 
5 a genome's a steal expression complexity. 

For example, when the Ch elegans genome was fully 
sequenced, gene prediction algorithms identified over 
iy,000 potential genes, of which only 7,300 nad been found 
by EST sequencing. C. elegans Sequencing Consortium., 

10 Science 282:2012 (1998). Analogously, the recently 

completed sequence of chromosome 2 of Arabidopsis predicts 
over 4000 genes, Lin et al. f Nature, 402:761 (1999), of 
which only about 6% had previously been identified via EST 
sequencing efforts. Although the human genome nas the 

15 greatest depth of EST coverage, it is still woefully short 
of surrendering all of its genes. One recent estimate 
suggests that the human genome contains more than 146,000 
genes., which would at this point leave greater than half of 
the genes undiscovered. It is now predicted that many 

20 genes, perhaps 20 to 50%, will only be found by genomic 
sequencing . 

There is, therefore, a need for methods that 
permit the functional regions of genomic sequence — and 
most importantly, but not exclusively, regions that 

25 function to encode genes — to be identified. 

Much of the coding sequence of the human genome 
is not homologous to known genes, making detection of open 
reading frames ("ORFs") and predictions of gene function 
difficult. Computational methods exist for predicting 

30 coding regions in eukaryotic genomes. Gene prediction 
programs such as GRAIL and GRAIL II, Uberbacher et al. r 
Proc. Natl. Acad, Sci . USA 88 (24 ): 11261-5 (1991); Xu et 
al. r Genet. Eng. 16:241-53 (1994); Uberbacher et al., 
Methods Enzymol. 266:259-81 (1996); GENEEINDER, Solovyev et 

35 al. r Nucl. Acids. Res. 22:5156-63 (1994); Solovyev et al. r 
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Ismb 5:234-302 '1997; ; an: GENESCAK , Burge s: ai. , J. MoJ . 
root. 261:71-94 (1997; , predict many ou:.a::ve ce::es without 
known homology or function . Such programs are known, 
however, tc give hrgh false positive races. Eurscc el al. f 
5 Genomics 34:353-367 (1996;. Using a consensus obtained by 
a plurality of such programs is known to increase the 
reliability of calling exons from genomic sequence. 
Ansari-Lari et al., Genome Res. 8(1): 29-40 (1998) 

Identification of functional genes from genomic 

10 data remains, however, an imperfect art. For example, in 
reoortiro the full sequence of h uma n c h r om o s om^ 21, oho 
Chromosome 21 Mapping and Sequencing Consortium reports 
that prior bioinf ormatic estimates of human gene number may 
need to be revised substantially downwards. Nature 

15 405:311-199 (2000); Reeves, Nature 405:283-284 (2000). 

Thus, there is a need for methods and apparatus 
that permit the functions of the regions identified 
bioinf ormaticaliy — and specifically, that permit the 
expression of regions predicted to encode protein — readily 

20 to be confirmed experimentally. 

Recently, the development of nucleic acid 

iuj.^iuui.j.a}' j iiao liidvufc; |juoo±jjjlc uiib auLUiildLcu cluu. iJiuHiy 

parallel measurement of gene expression. Reviewed in 
Schena (ed.), DNA Microarrays : A Practical Approach 

25 (Practical Approach Series ), Oxford University Press (1999) 
(ISBN: 0199637768); Nature Genet. 21 ( 1 ) ( suppl ) : 1 - 60 
(1999); Schena (ed. ) / Microarray Biochip: Tools and 
Technology , Eaton Publishing Company/BioTechniques Books 
Division (2000) (ISBN: 1881299376) . 

30 It is common for microarrays to be derived from 

cDNA/EST libraries, either from those previously described 
in the literature, such as those from the I.M.A.G.E. 
consortium, Lennon et al., Genomics 33(1): 151-2 (1996), or 
from the construction of "problem specific" libraries 

35 targeted at a particular biological question, R.S. Thomas 

5 



cv el. , Cancer Res. (in crsss ' . So oh aiicroarrsys bv 
definition nan nscsure expression only of these genes found 
in EST libraries, and onus have nor been useful as probes 
for genes discovered solely by genomic sequencing. 
5 The utility of using whole genome nucleic acid 

microarrays to answer certain biological questions has been 
demons orated for the yeast Saccharcmyces cerevisiae . De 
Risi er ai., Science 178:6cC (1397). The vast majority of 
yeast nuclear genes, approximately 95% however, are single 

10 exon genes, i.e., lack introns, Lopez et al. f RNA 5:1125- 
1137 (1999); Goffeau et al. t Science 274:563-67 (1996), 
permitting coding regions more readily to be identified. 
Whole genome nucleic acid microarrays have not generally 
been used to probe gene expression from more complex 

15 eukaryotic genom.es, and in particular from those averaging 
more than one intron per gene. 

Diseases of the heart and vascular system are a 
significant cause of human morbidity and mortality. 
Increasingly, genetic factors are being found that 

20 contribute to predisposition, onset, and/or aggressiveness 
of most, if not all, of these diseases. Although mutations 
in single genes have on occasion been identified as 
causative, these disorders are for the most part believed 
to have polygenic etiologies. There is a need for methods 

25 and apparatus that permit prediction, diagnosis and 

prognosis of diseases of the human heart, particularly 
those diseases with polygenic etiology. 

Summary of the Invention 

30 

The present invention solves these and other 
problems in the art by providing methods and apparatus for 
predicting, confirming, and displaying functional 
information derived from, genomic sequence. The present 
35 invention also provides apparatus for verifying the 

6 
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exorsssicr. of o'j^stivG aeries ^dentiiiec wichir: Genomic 

In carti:ular, the invention provides novel 
genome- derivec single exon nucleic acid mi oroar rays useful 
5 for verifying the expression of putative genes identified 
within genomic sequence . 

The present invention also provides compositions 
and kits for the ready production of nucleic, acids 
identical in sequence to, or substantially identical in 
10 sequence to, orobes on the genome-derived single exon 
mieroarrays of the present invention. 

Accordingly, in a first aspect of the invention, 
there is provided a spatially-addressable set of single 
exon nucleic acid orobes for measuring gene expression in a 
15 sample derived from human heart, comprising a plurality of 
single exon nucleic acid probes according to any one of the 
nucleotide sequences set out in SEQ ID NQs : 1 - 9,930 or a 
complementary sequence, or a portion of such a sequence. 

By plurality is meant at least two, suitably at 
20 least 20, most suitably at least 100, preferably at least 
1000 and, most preferably, upto 5000. 

In one embodiment of the first aspect, each of 
said plurality of probes is separately and addressably 
amplif iable . 

25 In an alternative embodiment, each of said 

plurality of probes is separately and addressably 

isolatable from said plurality. 

In a preferred embodiment, each of said plurality 

of probes is amplif iable using at least one common primer. 
30 Preferably, each of said plurality of probes is amplifiable 

using a first and a second common primer. 

In yet another embodiment, said set of single 

exon nucleic acid probes comprises between 50 - 20,000 

probes, for example, 50 - 5000. 
35 Suitably, said set of single exon nucleic acid 



/ 
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probes comprises at least 50 - 1000 discrete single exon 
nucleic acid probes having a sequence as set cue in any of 
SEQ ID NCS . : 1 - 19, 771 or a complimentary sequence, or a 
portion of such a sequence. 
5 Preferably, the average length of the single exon 

nucleic acid probes is between 200 and 500 bp. It is 
preferred that the average length should be at least 200bp, 
suitably at least 250bp, most suitably at least 300bp, 
preferably at least 400bp and, most preferably, 500 bp. 

10 In another embodiment, the single exon nucleic 

acid probes lack prokaryotlc and bacteriophage vector 
sequence. It is preferred that at least 50%, suitably at 
least 60%, most suitably at least 70%, preferably at least 
75%, more preferably at least 80, 85, 90, 95 or 99% of said 

15 single exon nucleic acid probes lack prokaryotic and 
bacteriophage vector sequence. 

In another preferred embodiment, said single exon 
nucleic acid lack homopolymeric stretches of A or T. It is 
preferred that at least 50%, suitably at least 60%, most 

20 suitably at least 70%, preferably at least 75%, more 

nrofornKl w p+- l^^o-h PH PR Q Pi Q R r^T Q Q 3r r\-F oaiH a -1 nrrl a 

exon nucleic acid probes lack homopolymeric stretches of A 
or T. 

Preferably, a spatially-addressable set of single 
25 exon nucleic acid probes in accordance with the first 

aspect of the invention is is addressably disposed upon a 
substrate . 

Suitable substrates include a filter membrane 
which may, preferably, be nitrocellulose or nylon. The 

30 nylon may preferably, be positively-charged. Other suitable 
substrates include glass, amorphous silicon, crystalline 
silicon, and plastic. Further suitable materials include 
polymethylacrylic, polyethylene , polypropylene , 
polyacrylate , polymethylmethacrylate, polyvinylchloride , 

35 polytetraf luoroethylene , polystyrene, polycarbonate , 

8 
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oolyacetci, pel ysulf one , cellulz^seacetate, 
ceilulosonirrare , nitrocellulose, ana mixtures thereof. 

In a second aspect of the invention, there is 
provided a microarray comprising a spatially addressable 
5 set of single exon nucleic acid probes in accordance with 
the first aspect of the invention. 

In one embodiment, a genome-derived single-exon 
microarray is packaged together with such an ordered set of 
amplifiable probes corresponding to the probes, or one or 
10 more subsets of probes, thereon. In alternative 

embodiments, the ordered set of amplifiable probes is 
packaged separately from the genome-derived single exon 
microarray. 

In another aspect, the invention provides genome- 
15 derived single exon nucleic acid probes useful for gene 
expression analysis, and particularly for gene expression 
analysis by microarray. In particular embodiments of this 
aspect, the present invention provides human single-exon 
probes that include specif ically-hybridizable fragments of 
20 SEQ ID Nos. 9,981 - 19,771, wherein the fragment hybridizes 

— , x~ A — ! v-iz-vq^^tt 4- /-\ iov^-v^oc o c^r^. Vinmqn rran o T pi 

particular embodiments, the invention provides single exon 
probes comprising SEQ ID Nos. 1 - 9,980. 

Accordingly, in a third aspect of the invention, 

25 there is provided a single exon nucleic acid probe for 
measuring human gene expression in a sample derived from 
human heart which is a nucleic acid molecule comprising a 
nucleotide sequence as set out in any of SEQ ID NOs.: 1 - 
9,980 or a complementary sequence or a fragment thereof 

30 wherein said probe hybridizes at high stringency to a 
nucleic acid expressed in the human heart. 

In one embodiment, a single exon nucleic acid 
probe in accordance with the third aspect comprises a 
nucleotide sequence as set out in any of SEQ ID NOs.: 9,981 

35 - 19,771 or a complementary sequence or a fragment thereof. 
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In a fourth aspect of the invention, there is 
provided a single exon nucleic arid prone tor measuring 
human gene expression in a sample derived from human heart 
which is a nucleic acid molecule having a sequence encoding 
5 a peptide comprising a peptide sequence as set out in any 
of SEQ ID NQs.: 19,772 - 29 r 119 or a complementary sequence 
or a fragment thereof wherein said probe hybridizes at high 
stringency no a nucleic acid expressed in the human heart. 

Preferably, a single exon nucleic acid probe in 

10 accordance with the third or fourth aspects of the 

invention comprises between at least 15 and 50 contiguous 
nucleotides of said SEQ ID NO: . It is preferred that the 
single exon nucleic acid probe comprises at least 15, 
suitably at least 20, more suitably at least 25 or 

15 preferably at least 50 contiguous nucleotides of said SEQ 
ID NO: . 

In another preferred embodiment, a single exon 
nucleic acid probe . in accordance with the third or fourth 
aspects of the invention is between 3kb and 25kb in length. 

20 It is preferred that said probe is no more than 3kb, 

suitably no more than 5kb, more suitably no more than IQkb, 
preferably 15kb, more preferably 20kb or, most preferably, 
no more than 20kb in length. 

Preferably, a single exon nucleic acid probe in 

25 accordance with either the fifth or sixth aspect of the 
invention is DNA , preferably single-stranded DNA, RNA or 
PNA. 

In another embodiment of either the third or 
■th a spec u o j_ L-u.e inven^r on, a single exon nucx eic acid 
30 probe is detectably labeled. Suitable detectable labels 
include a radionuclide, a fluorescent label or a first 
member of a specific binding pair. Suitable fluorescent 
labels include dyes such as cyanine dyes, preferably Cy3 
and Cy5 although other suitable dyes will be known to those 
35 skilled in the art. 



10 
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In a particularly preferred embodiment , a single 
ex'jM nucierc acid probe in accordance w::h either the third 
or fourth aspect of the invention lacks prokaryotic and 
bacteriophage vector sequence. In yet another embodiment, a 
5 single exon nuclei: acid probe in accordance with either 
the third or fourth aspect of the invention lacks 
homopoiymer ic stretches of A or T. 

in a fifth aspect of the invention, there is 
provided an amplifiabie nucleic acid composition 
10 comprising: 

the single exon nucleic acid probe in accordance 
with either of the third or fourth aspects of the 
invention; and at least one nucleic acid primer; 

wherein said at least one primer is sufficient to 
15 prime enzymatic amplification of said probe. 

In an sixth aspect of the invention, there is 
provided a method of measuring gene expression in a sample 
.derived from human heart, comprising: 

contacting the single exon microarray in 
20 accordance with the second aspect of the invention, with a 
first collection of detectably labeled nucleic acids, said 
first collection of nucleic acids derived from mRNA of 
human heart; and then 

measuring the label detectably bound to each 
25 probe of said microarray. 

In a seventh aspect of the invention, there is 
provided a method of identifying exons in a eukaryotic 
genome , comprising : 

algor ithmically predicting at least one exon from 
30 genomic sequence of said eukaryote; and then 

detecting specific hybridization of detectably 
labeled nucleic acids to a single exon probe, 

wherein said detectably labeled nucleic acids are 
derived from mRNA from the heart of said eukaryote, said 
35 probe is a single exon probe having a fragment identical in 
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sequence ~. r , or complementary m sequence to, sale: 
predicrec exon, sard probe is included wirhir a single oxer, 
miercarray in accordance with the first aspect of the 
invention, and said fragment is selectively hybridizable at 
5 h ~. q h stringency. 

In a eighth aspect of the invention, there is 
provided a method of assigning exons to a single gene, 
comprising : 

identifying a plurality of exons from genomic 
10 sequence in accordance with the seventh aspect of the 
i n v e n t i o n ; and then 

measuring the expression of each of said exons in 
a plurality of tissues and/or cell types using 
hybridization to single exon microarrays having a p>robe 
15 with said exon, 

wherein a common pattern of expression of said 
exons in said plurality of tissues and/or cell types 
indicates that the exons should be assigned to a single 
Qene . 

20 In an ninth aspect of the invention, there is 

provided a nucleic acid sequence as set out in any of SEQ 
ID NOs: 1 - 19,771 wherein said sequence encodes a peptide. 

In a tenth aspect of the invention, there is 
provided a peptide encoded by a sequence comprising a 

25 sequence as set out in any of SEQ ID NOs : 9,981 - 19,771, 
or a complementary sequence or coding portion thereof. 

In a preferred embodiment, a peptide may be 
encoded by a sequence comprising a sequence set out in any 
of SEQ ID NOS . : i - 9, 980 . 

30 In a further aspect, the invention provides 

peptides comprising an amino acid sequence translated from 
the DNA fragments, said amino acid sequences comprising SEQ 
ID NOS. : 9, 981 - 19,771. 

Accordingly in a eleventh aspect of the invention 

35 there is provided a peptide comprising a sequence as set 

12 
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out in any of SEQ ID NOs : 19,772 - 29, 119, or fragment 

c n ° r e ^ 

In another aspect, the invention provides means 
ror displaying annotated sequence, and in particular, for 
displaying sequence annotated according to the methods and 
apparatus of the present invention. Further, such display 
can be used as a preferred graphical user interface for 
electronic search, query, and analysis of such annotated 
sequence . 



Detailed Description of the Invention 
Definitions 

As used herein, the terra "microarray" and phrase 
"nucleic acid microarray" refer to. a substrate-bound 
collection of plural nucleic acids, hybridization to each 
of the plurality of bound nucleic acids being separately 
detectable. The substrate can be solid or porous, planar 
or non-planar, unitary or distributed. 

As so defined, the term " roicroarra^" an^ nhvaao 
"nucleic acid microarray" include all the devices so called 
in Schena (ed. ) , DNA Microarrays: A Practical Approach 
(Practical Approach Series ), Oxford University Press (1999) 
(ISBN: 0199637763)/ Nature Genet. 21 (1) (suppl) : 1 - 60 
(1999); and Schena (ed.), Microarray Biochip: Tools and 
Technology , Eaton Publishing Company/BioTechniques Books 
Division (2000) (ISBN: 1881299376). As so defined, the 
term "microarray" and ohrase "nucleic ac id mi c r o a r r a T T 
further include substrate-bound collections of plural 
nucleic acids in which the nucleic acids are distributably 
disposed on a plurality of beads, rather than on a unitary 
planar substrate, as is described, inter alia, in Brenner 
et al., Proc. Natl. Acad. Sci. USA 97 ( 4 ): 166501 670 (2000); 
in such case, the term "microarray" and phrase "nucleic 

1 n 
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acic nicroarray" refer to the plurality of beads on 
aggregate . 

As used herein with respect to a nucleic acid 
micro-array, the term "probe" refers :o the nucleic acid 
5 that is, or is intended to be, bound to the substrate; in 
such context, the term "target" thus refers to nucleic acid 
intended to be bound thereto by Watson-Crick 
complementarity. As used herein with respect to solution 
phase hybridization, the term "probe" refers to the nucleic 
10 acid of known sequence that is detectably labeled. 

As used herein, the expression "probe comprising 
SEQ ID NO.", and variants thereof, intends a nucleic acid 
probe, at least a portion of which probe has either (i) the 
sequence directly as given in the referenced SEQ ID NO., or 
15 (ii) a sequence complementary to the sequence as given in 
the referenced SEQ ID NO., the choice as between sequence 
directly as given and complement thereof dictated by the 
requirement that the probe hybridize to mRNA. 

As used herein, the term "open reading frame" and 
20 the equivalent acronym "ORF" refer to that portion of an 

exon that can be translated in its entirety into a sequence 
of contiguous amino acids i.e. a nucleic acid sequence 
that, in at least one reading frame, does not possess stop 
codons; the term does not require that the ORF encode the 
25 entirety of a natural protein. 

As used herein, the term "amplicon" refers to a 
PCR product amplified from human genomic DNA, containing 
the predicted exon. 

As used herein the term "exon" refers to the 
0 consensus prediction of the various exon and gene 
predicting algorithms i.e. a nucleic acid sequence 
bioinf ormatically predicted to encode a portion of a 
natural protein. 

As used herein, the term "peptide" refers to a 
5 sequence of amino acids. The sequences referred to as 

1 A 
1^ 
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PEPTIDE SEQ ID NOS . : are the predicted teptide seouenoes 
that woula be :ransiarea from one of the exons, or a 
portion thereof set out in exon SEQ ID NOS.:. The codons 
encoding ohe peptide are wholly contained v/ithin one excn . 
5 As used herein, a "portions" of a defined 

nucleotide sequence or sequences can be and, preferably, 
are fragments unique to that sequence or to one or a 
combination of those sequences. A fragment unique to a 
nucleic acid molecule is one that is a signature for the 

10 larger nucleic acid molecule. 

As used herein 7 the phrase "expression of a 
probe" and its linguistic variants means that the ORE 
present within the probe, or its complement, is present 
within a target mRNA. 

15 As used herein, "stringent conditions" refers to 

parameters well known to those skilled in the art. When a 
nucleic acid molecule is said to be hybridisable to another 
of a given sequence under "stringent conditions" it is 
meant that it is homologous to the given sequence. 

20 As used herein, the phrase "specific binding 

i t ,T i n t p n r] q a -n^T-r r^f rnr>"! annl oc fKpf nrl -h r-\ or^p qnnf Ko v 

with high specificity. Binding pairs are said to exhibit 
specific binding when they exhibit avidity of at least 10 7 , 
preferably at least 10 8 , more preferably at least 10 9 
25 liters/mole. Nonlimiting examples of specific binding 
pairs are: antibody and antigen; biotin and avidin; and 
biotin and streptavidin . 

As used herein with respect to the visual display 

30 any geometric shape that has at least a first and a second 
border, wherein the first and second borders each are 
capable of mapping uniquely to a point of another visual 
object of the display. 

As used herein, a "Mondrian" means a visual 

35 display in which a single genomic sequence is annotated 

15 
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5 Br ief Description of the drawings 

The present invention is further illustrated with 
reference to the following non-limiting figures and 
examples in which: 

10 FIG. 1 illustrates a process for predicting 

functional regions from genomic sequence/ confirming the 
functional activity of such regions experimentally, and 
associating and displaying the data so obtained in 
meaningful and useful relationship to the original sequence 

15 data; 

FIG - 2 further elaborates that portion of the 
process schematized in FIG. 1 for predicting functional 
regions from genomic sequence; 

FIG. 3 illustrates a Mondrian visual display; 
20 FIG. 4 presents a Mondrian showing a hypothetical 

annotated genomic sequence; 

fib. D XS Ct Hi ^ LUgrctm DfiUWlllU UilC U. _L b L. jl J-^J U U J- ^1 1 WJ- 

ORF length and PCR products as obtained, with ORF length 
shown in black and PCR product length shown in dotted 
25 lines; 

FIG . 6 is a histogram showing the distribution, 
among exons predicted according to the methods described, 
of expression as measured using simultaneous two color 
hybridization to a genome-derived single exon microarray. 
30 The graph shows the number of sequence-verified products 
that were either not expressed ("0"), expressed in one or 
more but not all tested tissues ("1" - "9"), or expressed 
in all tissues tested ("10"); 

FIG. 7 is a pictorial representation of the 
35 expression of verified sequences that showed expression 

16 
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with sicnal ir.tensitv crsstsr than 3 in at Is ^s^ or. e 
tissue, w::::: FIG. "A snowing i;:e expression as measured by 
rir.rcarray hybridization in each of the 1C measured 
tissues, ana tne expression as measured "biotas srmati caiiy " 
5 by query of EST, NR and SwissPrct databases; wish FIG. 73 
showing tne legend for display of physical express isn 
(ratio) in FIG. 7A; and with FIG. 7C showing the legend for 
scoring EST hits as depicted in FIG. 7A; 

FIG. 8 shows a comparison of normalized CY3 
10 signal intensity for arrayed sequences that were identical 
to sequences in existing EST, NR and SwissProt databases or 
that were dissimilar (unknown) , where black denotes the 
signal intensity for all sequence-verified products with a 
BLAST Expect ("E") value of greater than Ie-30 (1 x 10" 30 ) 
15 ("unknown"! and a dotted line denotes sequence- verified 

spots with a BLAST expect ("E") value of less than Ie-30 (1 
x 10" 30 ) ("known") ; 

FIG. 9 presents a Mondrian of BAG ACQ03172 (bases 
25,000 to 130,000), containing the carbamyl phosphate 
20 synthetase gene ( AF154 8 30 . 1 ) ; and 

FIG . 10 is a Mondrian of BAG A049839. 



Methods and Apparatus for Predicting, Confirming, 
25 Annotating, and Displaying Functional Regions From Genomic 
Sequence Data 

FIG. 1 is a flow chart illustrating in broad 
outline a process for predicting functional regions from 
30 genomic sequence, confirming and characterizing the 

functional activity of such regions experimentally, and 
then associating and displaying the information so obtained 
in meaningful and useful relationship to the original 
sequence data. 

35 The initial input into process 10 of the present 

17 
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invention is drawn from one :r more databases iOO 
containing genomic sequence lata. Because genomic saa^snce 
is usually obtained from sub genomic fragments, the sequence 
data typically will be stored in a series of records 
5 corresponding zo these suh genomic sequenced fragments. 
Some fragments will have been catenated to form larger 
contiguous sequences ( ,f cont i gs " ) ; others will not. A 
finite percentage of sequence data in the database will 
typically be erroneous, consisting inter alia of vector 

10 sequence, sequence created from aberrant cloning events, 
sequence of artificial polylinkers, and sequence that was 
erroneously read. 

Each sequence record in database 100 will 
minimally contain as annotation a unique sequence 

15 identifier (accession number) , and will typically be 
annotated further to identify the date of accession, 
species of origin, and depositor. Because database 100 can 
contain nongenomic sequence, each sequence will typically 
be annotated further to permit query for genomic sequence. 

20 Chromosomal origin, optionally with map location, can also 

hR DT^Sent . Hat^ ran hp, ^nH r)^7£>lr f-imo -i n o to =a o -1 -nm 1 \t u-i 1 1 

be, further annotated with additional information, in part 
through use of the present invention, as described below. 
Annotation can be present within the data records, in 
25 information external to database 100 and linked to the 
records thereto, or through a combination of the two. 

Databases useful as genomic sequence database 100 
in the present invention include GenBank, and particularly 

i nrl iiHd ca^ora 1 Hi tti o -i r> n o i" ^ c, r d n ^ r> ^ 1 1 1 /-J -I >->/-+ +■ 1-, ^ 

-l. * i. ^ _l. ^ ^ w v ^- i. ^ _^ vj-^jj_wii^j i-uLI LOi. , J.I1UJ.UUJ.HIJ LUC 

30 htgs (draft), NT (nucleotide, command line), and NR 

(nonredundant ) divisions. GenBank is produced by the 
National Institutes of Health and is maintained by the 
National Center for Biotechnology Information (NCBI) . 
Databases of genomic sequence from species other than 

35 human, such as mouse, rat, Arabidopsis, C. elegans, C. 

18 
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bricjsi i, Drvsjphila, zebra fish, and other higher 
^vVarv'^ ir organisms v;ill also prove useful as genomic 
sequence da -abase 110. 

Genomic sequence obtained by query of genomic 
c s ecu a nee database 10 0 is then input into one or more 

processes 210 for identification of regions therein that 
are predicted to have a biological function as specified by 
the user. lush functions include, but are not limited to, 
encoding protein, regulating transcription, regulating 
10 message transport after transcription inro mRNA, regulating 

„_ ~\ 1 ^ 1 n ~ x: ±- ^ 4- v =j T-i c: v i Y-i -h i o-n i nl*n m R NI A , P> "F 

mc; ioayc opxj-oxny cii.i-e.i- L.i-ui^^j.4.^i-j.v/. 4 1 a ■ — - . — ^'--/ 

regulating message degradation after transcription into 
mRNA, and the like. Other functions include directing 
somatic recombination events, contributing to chromosomal 

15 stability or movement, contributing to allelic exclusion or 
X chromosome inactivation, and the like. 

The particular genomic sequence to be input into 
process 200 will depend upon the function for which 
relevant sequence is to be identified as well as upon the 

20 approach chosen for such identification. Process step 200 
can be iterated to identify different functions witnin a 
given genomic region. In such case, the input often will 
be different for the several iterations. 

Sequences predicted to have the requisite 

25 function by process 200 are then input into process 300, • 
where a subset of the input sequences suitable for 
experimental confirmation is identified. Experimental 
confirmation can involve physical and/or bioinf ormatic 
assay. Where the subsequent experimental assay is 

30 bioinformatic, rather than physical, there are fewer 

constraints on the sequences that can be tested, and in 
this latter case therefore process 300 can output the 
entirety of the input sequence. 

The subset of sequences output from process 300 

35 is then used in process 400 for experimental verification 

19 
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and characterization of one function predicted in 
process /. \ I , v;hich experimental verification tan, and often 
Vvt.il, include both physical and bioinf ormat i z assay. 

Process 5 CO annotates the sequence data with the 
functional information obtained in the physical and/or 
bioinf ormatic assays of process 400. 'Such annotation can 
be done using any technique that usefully relates the 
functional information to the sequence, as, for example, by 
incorporating the functional data into the sequence data 
record itself, by linking records in a hierarchical or 
relational database, by linking to external databases, by a 
combination thereof, or by other means well known within 
the database arts. The data can even be submitted for 
incorporation into databases maintained by others, such as 
15 GenBank, which is maintained by NCBI . 

As further noted in FIG. 1, additional annotation 
can be input into process 500 from external sources 600. 

The annotated data is then displayed in process 
800, either before, concomitantly with, or after optional 
20 storage 700 on nontransient media, such as magnetic disk, 

ODtical disc, maanet nnnt i 1 Hi qV f 1 ach moniAru 

like. 

FIG. 1 shows that the experimental data output 
from process 400 can be used in each preceding step of 
25 process 10: e.g., facilitating identification of functional 
sequences in process 200, facilitating identification of an 
experimentally suitable subset thereof in process 300, and 
facilitating creation of physical and/or informational 

30 functional sequences m process 400. 

Information from each step can be passed directly 
to the succeeding process, or stored in permanent or 
interim form prior to passage to the succeeding process. 
Often, data will be stored after each, or at least a 
plurality, of such process steps. Any or all process steps 

20 
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functional secuence within generic sequence acccrding to 
process 20 j . 

for genomic sequence. 

The sequence required to be returned by query 20 
will depend, in the first instance, upon the function to be 
identified . 

10 For example, genomic sequences that function to 

encode protein can be identiried inter alia using gene 
prediction approaches, comparative sequence analysis 
approaches, or combinations of the two. In gene prediction 
analysis, sequence from one genome is input into process 

15 200 where at least one, preferably a plurality, of 

algorithmic methods are applied to identify putative coding 
regions. In comparative sequence analysis, by contrast, 
corresponding, e.g., syntenic, sequence from a plurality of 
sources, typically a plurality of species, is input into 

20 process 200, where at least one, possibly a plurality, of 
algorithmic methods are applied to compare the sequences 
and identify regions of least variability. 

The exact content of query 20 will also depend 
upon the database queried. For example, if the database 

25 contains both genomic and nongenomic sequence, perhaps 
derived from multiple species, and the function to be 
determined is protein coding regions in human genomic 
sequence, the query will accordingly require that the 
sequence returned be genomic and derived from humans. 

30 Query 20 can also incorporate criteria that 

compel return of sequence that meets operative requirements 
of the subsequent analytical method. Alternatively, or in 
addition, such operative criteria can be enforced in 
subsequent preprocess step 24. 

35 For example, if the function sought to be 
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identified is protein coding query 2i can incorporate 

i" - 6 r ^ 3, ^"'^a '*~ ^ e ' 4 ~ u m f r o m o o ^ j m i o s e o d e n o e database 1 0 ^ 1 
only those sequences present within contigs suffici entry 
long as to have obviated substantial f ragmentat i on of any 
5 oiven exon anion d a plurality of separate seouence 
fragments . 

Such criteria can, for example, consist of a 
required minimal individual genomic sequence fragment 
length, such as 10 kb, more typically 20 kb, 30 kb, 40kb, 
10 and preferably 50 kb or more, as well as an optional 

f n vf ho v ^ 1 f orriT!^ i ttq ropfii i ramdnf f hs^" Qomi£i?^r-<o -F v r-M-n n n w 

given clone, such as a bacterial artificial chromosome 
("BAG")', be presented in no more than a finite maximal 
number of fragments, such as no more than 2 0 separate 

15 pieces, more typically no more than 15 fragments, even more 
typically no more than about 10 - 12 fragments. 

Results using the present invention have shown 
that genomic sequence from bacterial artificial chromosomes 
(BACs) is sufficient for gene prediction analysis according 

20 to the present invention if the sequence is at least 50 kb 
in length, and if additionally the sequence from any given 
BAG is presented in fewer than 15, and preferably fewer 
than 10, fragments. Accordingly, query 20 can incorporate 
a requirement that data accessioned from BAC sequencing be 

25 in fewer than 15, preferably fewer than 10, fragments. 

An additional criterion that can be incorporated 
into the query can be the date, or range of dates, of 
sequence accession. Although the process has been 
described above' as if genomic sequence database 100 were 

30 static, it is of course understood that the genomic 
sequence databases need not be static, and indeed are 
typically updated on a frequent, even hourly, basis. Thus, 
as further described in Examples 1 and 2, infra, it is 
possible to query the database for newly added sequence, 

35 either newly added after an absolute date, or newly added 
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relative to a prior analysis performed using the methods 
a r i u 3 r^tds oi t a e o rase n t in. v e n 1 1 o n . n t n a s w a y , t n e 
process herein described can incorporate a dynamic, 
t err.p o r a 1 c omp onent. 
5 One utility of such temporal limitation is to 

identify, from newly accessioned genomic sequence, the 
presence of novel genes, particularly those not previously 
identified by EST sequencing (or other sequencing efforts 
that are similarly based upon gene expression) . As further 

10 described in Example 1, such an approach has shown that 

newly accessioned human genomic sequence, when analyzed for 
sequences that function to encode protein, readily 
identifies genes that are novel over those in existing EST 
and other expression oataoases. This makes the methods of 

15 the present invention extremely powerful gene discovery 
tools. And as would be appreciated, such gene discovery 
can be performed using genomic sequence from species other 
than human. 

If query 20 incorporates multiple criteria, such 
20 as above-described, the multiple criteria can be performed 
as a series of separate queries or as a single query, 
depending in part upon the query language, the complexity 
of the query, and other considerations well known in the 
database arts. 

25 If query 20 returns no genomic sequence meeting 

the query criteria, the negative result can be reported by 
process 22, and process 200 (and indeed, entire process 10) 
ended 23, as shown. Alternatively, or in addition to 
report and termination of the initial inquiry, a new query 

30 20 can be generated that takes into account the initial 
negative result. 

When query 20 returns sequence meeting the query 
criteria, the returned sequence is then passed to optional 
preprocessing 24, suitable and specific for the desired 

35 analytical approach and the particular analytical methods 
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thereof :c be used in process 25. 

Preprocessing 24 can induce processes suitable 
for many approaches and methods thereof, as well as 
processes specifically suited for the intended subsequent 
5 analysis. 

Preprocessing 24 suitable for most approaches and 
methods will include elimination of sequence irrelevant to, 
cr that would interfere with, the subsequent analysis. 
Such sequence includes repetitive sequence, such as Alu 

10 repeats and LINE elements, vector sequence, artificial 
sequence, such as artificial polylinkers, and the like. 
Such removal can readily be performed by identification and 
subsequent masking of the undesired sequence. 

Identification can be effected by comparing the 

15 genomic sequence returned by query 20 with public or 
private databases containing known repetitive sequence, 
vector sequence, artificial sequence, and other artifactual 
sequence. Such comparison can readily be done using 
programs well known in the art, such as CROSS_MATCH, or by 

20 proprietary sequence comparison programs the engineering of 
which is well within the skill in the art. 

Alternatively, or in addition, undesirable, 
including artifactual, sequence can be identified 
algorithmically without comparison to external databases 

25 and thereafter removed. For example, synthetic polylinker 
sequence can be identified by an algorithm that identifies 
a significantly higher than average density of known 
restriction sites. As another example, vector sequence can 
be identified by algorithms that identify nucleotide or 

30 c 3 don usage at variance with that of the bulk of the 
genomic sequence . 

Once identified, undesired sequence can be 
removed. Removal can usefully be done by masking the 
undesired sequence as, for example, by converting the 

35 specific nucleotide references to one that is unrecognized 
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-v the subseaueirt bioinf ormatic algorithir.s , such as "X" . 
Alternatively, bur at present less preferred, the urdesirec 
sequence can ce excised froir. the returned genomic sequence, 
leaving gaps . 

5 Preprocessing 24 nan further include selection 

from among duplicative sequences of that one sequence of 
highest quality. Higher quality can be measured as a lower 
percentage of, fewest number of, or least densely clustered 
ocsurrence of ambiguous nucleotides, defined as those 

10 nucleotides that are identified in the genomic sequence 
, ,^.1^^ r-wmX^ic. inrj ^mh i en 1 ^ t \/ . Hi aher aualitv can 

also or alternatively be valued by presence in the longest 
contig . 

Preprocessing 24 can, and often will, also 

15 include formatting of the data as specifically appropriate 
for passage to the analytical algorithms of process 25. 
Such formatting can and typically will include, inter alia, 
addition of a unique sequence identifier, either derived 
from the original accession number in genomic sequence 

20 database 100, or newly applied, and can further include 
additional annotation. curmciLLiny oan n^idut v tjj. ^^.^^ 
from one to another sequence listing standard, such as 
conversion to or from FASTA or the like, depending upon the 
input expected by the subsequent process. 

25 Preprocessing, which can be optional depending 

upon the function desired to be identified and the 
informational requirements of the methods for effecting 
such identification, is followed by sequence processing 25, 
where sequences with the desired function are identified 

30 within the genomic sequence. 

As mentioned above, such functions can include, 
but are not limited to, encoding protein, regulating 
transcription, regulating message transport after 
transcription into mRNA, regulating message splicing after 

35 transcription, of regulating message degradation, and the 

25 



PCT/LSO 1/00666 



functions include aireccina somacic 
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stability 



or movement , 



contributing to alleli 



c exclusion or 



chronosome 1 



nactivation, or one like. 



The methods of ohe preseno invention are 



particularly useful for gene discovery, that is, for 
identifying, from genomic sequence, regions that function 
to encode genes, and in a particularly useful embodiment, 
for identifying regions that function to encode genes not 
10 hitherto identified by expression-based or directed cloning 
and sequencing. In conjunction with verification using the 
novel single exon microarrays of the present invention, as 
further described below, the methods herein described 
become powerful crene discovery tools. 



present invention, process 25 is used to identify putative 
coding regions. Two preferred approaches in process 25 for 
identifying sequence that encodes putative genes are gene 
prediction and comparative sequence analysis. 



number of algorithmic methods, embodied in one or more 
software programs, that identify open reading frames (ORFs) 
using a variety of heuristics, such as GRAIL, DICTION, and 
GENEFINDER. Comparative sequence analysis similarly can be 
25 performed using any of a variety of known programs that 
identify regions with lower sequence variability. 

As further described in Example 1, below, gene 
finding software programs yield a range of results. For 
the newly accessioned human genomic sequence input in 

percentage of genomic sequence as putative coding region, 
2% of the data analyzed; GENEFINDER was second, calling 1%; 
and DICTION yielded the least putative coding region, with 
0.8% of genomic sequence called as coding region. 
35 Increased reliability can be obtained when 



Accordingly, in a preferred embodiment of the 
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consensus :s required amor a several such methods. Aithcuah 

consensus among methods will on general increase 
reliability of predicting other functions as well. 
} Tr.us, as indicate:; by query 2 6, seauence 

processing 25, optionally with preprocessing 24, can be 
repeated wicn a different method, with consensus among such 
iterations determined and reported in process 27. 

Process 27 compares the several outputs for a 
10 given input genomic sequence and identifies consensus among 
the separately reported results. The consensus itself, as 
well as the sequence meeting that conseusus , is then stored 
in process 29a, displayed in process 29b, and/or output to 
process 300 for subsequent identification of a subset 
15 thereof suitable for assay. 

Multiple levels of consensus can be calculated 
and reported by process 27. For example, as further 
described in Example 1, mftra, process 27 can reoort 
consensus as between all specific pairs of methods of gene 
20 prediction, as consensus among any one or more of the pairs 
of methods of gene prediction, or as among all of the gene 
prediction algorithms used. Thus, in Example 1, process 27 
reported that GRAIL and GENEFINDER programs agreed on 0.7% 
of genomic sequence, that GRAIL and DICTION agreed on 0.5% 
25 of genomic sequence,- and that the three programs together 
agreed on 0.25% of the data analyzed. Put another way, 
0.25% of the genomic sequence was identified by all three 
of the programs as containing putative coding region. 

Furthermore, consensus can be required among 
0 different approaches to identifying a chosen function. 

For example, if the function desired to be 
identified is coding of protein sequence, and a first used 
approach to exon calling is gene prediction, the process 
can be repeated on the same input sequence, or subset 
5 thereof, with another approach, such as comparative 
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sequence analysis. In such a case, where comparative 
sequence analysis fsiiov/s gene prediction, the comparison 
can be performed not only on genomic nucleic acid sequence, 
but additionally or alternatively can be performed on the 
5 predicted amino acid sequence translated from the ORFs 
prior identified by the gene prediction approach. 

Although shown as an iterative process, the 
multiple analyses required to achieve consensus can be done 
in series, in parallel, or some combination thereof. 
10 Predicted functional sequence, optionally 

representing a consensus among a plurality of methods and 
approaches for determination thereof, is passed to process 
300 for identification of a subset thereof for functional 
assay. 

15 In the preferred embodiment of the methods of the 

present invention, wherein the function sought to be 
identified is protein coding, process 300 is used to 
identify a subset thereof suitable for experimental 
verification by physical and/or bioinf ormatic approaches. 

20 For example, putative ORFs identified in process 

200 can be classified, or binned, bioinf ormatically into 
putative genes. This binning can be based inter alia upon 
consideration of the average number of exons/gene in the 
species chosen for analysis, upon density of exons that 

25 have been called on the genomic sequence, and other 

empirical rules. Thereafter, one or more among the gene- 
specific ORFs can be chosen for subsequent use in gene 
expression assay. 

Where such subsequent gene expression assay uses 

30 amplified nucleic acid, considerations such as desired 
amplicon length, primer synthesis requirements, putative 
exon length, sequence GC content, existence of possible 
secondary structure, and the like can be used to identify 
and select those ORFs that appear most likely successfully 

35 to amplify. Where subsequent gene expression assay relies 
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r.ycr idiiacion stringency car oe applied to identify chat 
.subset of sequences that will most readily oermit sequence- 
5 specific discrimination at a chosen hybridization and wash 
stringency. One particular such consideration is avoidance 
C'f putative exons that span repetitive sequence; such 
sequence can hybridize spuriously to nonspecific message, 
reducing specific signal in the hybridization. 

10 For bioinf ormatic assay, there are fewer 

constraints on the sequences that can be tested 
experimentally, and in this latter case therefore process 
30C can output the entirety of tine input sequence. 

The subset of sequences identified by process 300 

15 as suitable for use in assay is then used in process 400 to 
create the physical and/or informational substrate for 
experimental verification of the predictions made in 
process 200, and thereafter to assay those substrates. 
As mentioned, the methods of the present 

20 invention are particularly useful for identifying potential 
coding regions within genomic sequence. In a preferred 
embodiment of process 400, therefore, the expression of the 
sequences predicted to encode protein is verified. The 
combination of the predictive and experimental methods 

25 provides a powerful gene discovery engine. 

Thus, in another aspect, the present invention 
provides methods and apparatus for verifying the expression 
of putative genes identified within genomic sequence. In 
particular, the invention provides a novel method of 

30 verifying gene expression in which expression of predicted 
GRFs is measured and confirmed using a novel type of 
nucleic acid microarray, the genome-derived single exon 
nucleic acid microarrays of the present invention. 

Putative ORFs as predicted by a consensus of gene 

35 calling, particularly gene prediction, algorithms in 
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conveniently used, other amplification approaches can aiso 
5 be used. 

Simplification schemes can be designed no capture 
the entirety of each predicted ORF in an amp 1 icon with 
minimal additional (that is, intronic or intergenic; 
sequence. Because ORFs predicted from human genomic 
10 sequence using the methods of the present invention differ 
in length, such an approach results in ampiicons of varying 
length . 

However, most predicted ORFs are shorter than 500 
bp in length, and although ampiicons of at least about 100 

15 or 20Q base pairs can be immobilized as probes on nucleic 
acid microarrays, early experimental results using the 
methods of the present invention have suggested that longer 
ampiicons, at least about 400 or 500 base pairs, are more 
•effective. Furthermore, certain advantages derive from 

20 application to the microarray of ampiicons of defined size. 
Therefore, amplification schemes can 
alternatively, and preferably, be designed to amplify 
regions of defined size, preferably at least about 300, 400 
or 500 bp, centered about each predicted ORF. Such an 

25 approach results in a population of ampiicons of limited 
size diversity, but that typically contain intronic and/or 
intergenic nucleic acid in addition to putative ORF. 

Conversely, somewhat fewer than 10% of ORFs 
predicted from human genomic sequence according to the 

30 methods of the present invention exceed 500 bp in length. 
Portions of such extended ORFs, preferably at least about 
300,400 or 500 bp in length, can be amplified. However, it 
has been discovered that the percentage success at 
amplifying pieces of such ORFs is low, and that such 

35 putative exons are more effectively amplified when larger 



large as 20 jj op are amplified. 

The putative ORFs selected in process 300 are 
thus input int: one or more primer design programs, such as 
5 PRIMER 3 (available online for use 51 

http: //v;v;w-geri:ne.v/i .mit.edu/cgi-bin/prirrier/ ) , with a coal 
of amplifying at least about 500 oase pairs of genomic 
sequence centered within or about ORFs predicted to be no 
more than about 500 bp, or at least about 1000 - 1500 bp of 

10 genomic sequence for ORFs predicted to exceed 500 op in 

length, and the primers synthesized by standard techniques. 
Primers with the requisite sequences can be purchased 
commercially or synthesized by standard techniques. 

Conveniently, a first pro determined sequence can 

15 be aoded commonly to the ORF-specific 5' primer and a 
second, typically different, predetermined sequence 
commonly added to each 3' ORF-unique primer. This serves 
to immortalize the amplicon, that is, serves to permit 
further amplification of any amplicon using a single set of 

20 primers complementary respectively to the common 5 1 and 
common 3' sequence elements. The presence of these 
"universal" priming sequences further facilitates later 
sequence verification, providing a sequence common to all 
amplicons at which to prime sequencing reactions. The 

25 common 5' and 3' sequences further serve to add a cloning 
site should any of the ORFs warrant further study. 

Such predetermined sequence is usefully at least 
about 10, 12 or 15 nt in length, and usually does not 
exceed about 25 nt in length. The "universal" priming 

30 sequences used in the examples presented infra were each 16 
nt long. 

The genomic DNA to be used as substrate for 
amplification will come from the eukaryotic species from 
which the genomic sequence data had originally been 
35 obtained, or a closely related species, and can 
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conveniently be preoared by well known techniques iron 
somatic or cernlire tissue or cultureei cells of the 
organism. See, e.g., Snort Protocols in Mol ecu lar Biology 
: A Come e noli um of Methods from Current Protocols in 
5 Mole cul ar Bi olo gy, Ausubel St ai. (eds. ) f 4~ r ' edition 
(April 1999), John Wiley & Sons (I33N: 047132938X) and 
Ma n i a t i s et a 1 . , Molecular Cloning : A Laboratory Manual , 
2 na edition (December 1989), Cold Spring Harbor Laboratory 
Press (ISBN: 0S79693096). Many such prepared genomic DNAs 
10 are available commercially, with the human genomic DNAs 
additionally having certification of donor informed 
consent . 

Although the intronic and intergenic material 
flanking putative coding regions in the amplicons could 

15 potentially interfere with hybridizations during microarray 
experiments, we have found, surprisingly, that differential 
expression ratios are not significantly affected. Rather, 
the predominant effect of exon size is to alter the 
absolute signal intensity, rather than its ratio. Equally 

20 surprising, the art had suggested that single exon probes 
would not provide sufficient signal intensity for high 
stringency hybridization analyses; we find that such probes 
not only provide adequate signal, but have substantial 
advantages, as herein described. 

25 After partial purification, as by size exclusion 

spin column, with or without confirmation as to amplicon 
quality as by gel electrophoresis, each amplicon (single 
exon probe) is disposed in an array upon a support 
substrate . 



seating mil cro arrays by deposition 



and fixation of nucleic acids onto support substrates are 
well known in the art (Reviewed by Schena et al., see 
above) . 

Typically, the support substrate will be glass, 
35 although other materials, such as amorphous or crystalline 
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silicon or elastics. Such elastics include 
polymetnylacryl ic , polyethylene , pel ypr cpy rene , 
poly a sry lacs , polymethylmethacrylate , polyvinyl chloride , 
oolytetraf iuoroethylene , polystyrene , polycarbonate, 

5 polyacetal , polysulf one , celluloseacetate, 

cellulosenitrate, nitrocellulose, or mixtures thereof, can 
also be used. Typically, the support vjiii be rectangular, 
although other shapes, particularly circular disks and even 
spheres, present certain advantages. Particularly 

10 advantageous alternatives to glass slides as support 

substrates for array of nucleic acids are optical discs, as 
described in WO 98/12559. 

The amplified nucleic acids can be attached 
covalenoly to a surface of the support substrate or, more 

15 typically, applied to a derivarized surface in a chaotropic 
agent that facilitates denaturation and adherence by 
presumed noncovalent interactions, or some combination 
thereof . 

Robotic spotting devices useful for arraying 
20 nucleic acids on support substrates can be constructed 
using public domain specifications (The MGuide, version 
2.0, http: //cmgm. Stanford. edu/pbrown/mguide/ index . html ) , or 
can conveniently be purchased from commercial sources 
(MicroArray Genii Spotter and MicroArray Genlll Spotter, 
25 Molecular Dynamics, Inc., Sunnyvale, CA) . Spotting can 

also be effected by printing methods, including those using 

J. n K jfeu utUllll'jXwy ) . 

As is well known in the art, microarrays 
typically also contain immobilized control nucleic acids. 

r^mpnt.q of backcrround 

J5 \J £UJL UVKLi'Ji J -u o t JL i-t J- i.n j^j-wvj.. \^-L.i.±<-2 .ll^v-^ v.^ — 

signal for the genome-derived single exon microarrays of 
the present invention, a plurality of E . coli genes can 
readily be used. As further described in Example 1, 16 or 
32 E. coli genes suffice to provide a robust measure of 
35 background noise in such microarrays. 



ns is well known in one art, the amplified 

nucleotides linked by phcsphodiester bonds, sr 
5 alternatively can include eitner nonnatrve nucleotides/ 
alternative internucleotide linkages, or boon, so long as 
complenentary binding can be obtained in the hybridization 
it enzymatic amplification is used to pioduce Lhe 
immiooilized probes, the amplifying enzyme will impose 
10 certain further constraints upon the types of nucleic acid 
analogs that can be generated. 

Although particularly described herein as using 
high density microarrays constructed on planar substrates, 
the methods of the present invention for confirming the 
15 expression of ORFs predicted from genomic sequence can use 
any :>f the known types of microarrays, as herein defined, 
including lower density planar arrays, and microarrays on 
nonplanar, nonunitary, distributed substrates. 

For example, gene expression can be confirmed 
20 using hybridization to lower density arrays, such as those 
constructed on membranes, such as nitrocellulose, nylon, 

Further, gene expression can also be confirmed using 
nonplanar, bead-based microarrays such as are described in 

25 Brenner et al., Proc. Natl. Acad. Sci. USA 97 ( 4 ) : 16 6501 67 0 
(2000); U.S. Patent No. 6,057,107; and U.S. Patent No. 
5,736,330. In theory, a packed collection of such beads 
provides in aggregate a higher density of nucleic acid 
probe than can be achieved with spotting or lithography 

30 techniques on a single planar substrate. 

Planar microarrays on solid substrates, however, 
provide certain useful advantages, including high 
throughput and compatibility with existing readers. For 
example, each standard microscope slide can include at 

35 least 1000, typically at least 2000, preferably 5000 and 

34 



upto 10,000 - 50,000 or more nucleic acid probes cf 
discrete sequence. The number of sequences depositee will 
depend on fneir required application. 

Each putative gene can be represented in the 
5 array by a single predicted ORF. Alternatively, genes can 
be represented by more than one predicted ORF. For 
purposes of measuring differential splicing, more than one 
predicted ORF will be provided for a putative gene. And as 
is well known in the art, each probe of defined sequence, 

10 representing a single predicted ORF, can be deposited in a 
plurality of locations on a single micro-array :o provide 
redundancy of signal. 

The genome-derived single exon microarrays 
described above differ in several fundamental and 

15 advantageous ways from micr oar rays presently used in the 
gene expression art, including (1) those created by 
deposition of mRNA-derived nucleic acids, (2) those created 
by in situ synthesis of oligonucleotide probes, and (3) 
those constructed from yeast genomic DNA. 

20 Most nucleic acid microarrays that are in use for 

study of eukaryotic gene expression have as immobilized 

nrp.Hoc: n 1 1 r* 1 o -i n^-l He f- "h ^ -h p v ^ H v- -■ t t ^ rJ . p. -I f. "U ~ v 4-1?:- . 

^j-u^j^/^.^- iiUvvi.^ j, a «^ j- ^ >j ^iidu. ci_uc; \_>icij l v c \«a dj. unci uij. Cul-l^' 

indirectly — from expressed message. As discussed above, 
it is common, for example, for such microarrays to be 

25 derived from cDNA/EST libraries, either from those 

previously described in the literature, see Lennon et al., 
or from the de novo construction of "problem specific" 
libraries targeted at a particular biological question, 
R.S. Thomas et al., Cancer Res. (in press). Such 

30 microarrays are herein collectively denominated "EST 
microarrays" . 

Such EST microarrays by definition can measure 
expression only of those genes found in EST libraries, 
shown herein to represent only a fraction of expressed 

35 genes. Furthermore, such libraries - and thus microarrays 

35 



based thereupon — are b i a s e d by t h 6 tissue or eel] t y d e of 
message origin, by :he expression ieveis of the respective 
genes witnm trie tissues, ana by tne ability of the message 
successfully to have been reverse-transcribed and cloned. 
5 Thus, as further discussed in Example 1, the 

methods of the present invention enable sequences that do 
not appear in EST or other expression databases to be 

X ^ +• ^ -^r« ■! ^ ^ ^,,u^^^r,.^^4-1,, ^; ^r__ j _ 

measurements could not, therefore, have been represented as 
10 probes on an EST microarray. And as further demonstrated 
in the examples, infra, the remaining population of genes 
identified from genomic sequence by the methods of the 
present invention — that is, the one third of sequences 
that had previously been accessioned in EST or other 

1 ^ cyr, roc r t r,r, -i 4- -3 V-\ ci c o ^ — rj ->-- V\ -! — . r-- ^ 3 f /-\tj — . v- >~ o i i -i 1- V* -I ^ 

i ,j ^AuiCL) t >iuij ud lcIjjc; jco .uiaobu u v- 1 w a. .l w i l J i I n y I : c I 

expression levels . 

Representation of a message in an EST and/or cDNA 
library depends upon the successful reverse transcription, 
optionally but typically with subsequent successful 

20 cloning, of the message. This introduces substantial bias 
into the population of probes available for arraying in EST 
mi cr oar rays . 

In contrast, neither reverse transcription nor 
cloning is required to produce the probes arrayed on the 

25 genome-derived single exon microarrays of the present 

invention. And although the ultimate deposition of a probe 
on the genome-derived single exon microarray of the present 
invention depends upon a successful amplification from 
genomic material, a priori knowledge of the sequence of the 

30 desired amplicon affords greater opportunity to recover any 
given probe sequence recalcitrant to amplification than is 
afforded by the requirement for successful reverse 
transcription and cloning of unknown message in EST 
approaches . 

35 Thus, the genome-derived single exon microarrays 
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of the cresent .invention present a far greater diversity of 
probes for measuring gene expression, vnth far less bias, 
than do EST microarrays presently used in the art. 

As a further consequence of their ultimate origin 
5 from expressed message, the probes in EST microarrays often 
contain poly-A (or complementary poly-T) stretches derived 

i i jili lk~ — y /~i ^dii v _> J- kia l b liiiMviri. ±i~-zzot: ii jiuup'jj.^ilidu.^ 

stretches contribute to cross - hybridization/ that is, to a 
spurious signal occasioned by hybridization to the 

10 homopclymeric tail of a labeled cDNA that lacks sequence 
homology to the. gene-specific portion of the probe. 

In contrast, the probes arrayed in the genome- 
derived single exon microarrays of the present invention 
lack homopclymeric stretches derived from message 

15 polyadenylation, and thus can provide more specific signal. 
Typically, at least about 50, 60 or 75% of the probes on 
the genome-derived single exon microarrays of the present 
invention lack homopolymeric regions consisting of A or T, 
where a homopolymeric region is defined for purposes herein 

20 as stretches of 25 or more, typically 30 or more, identical 
nucleotides . 

A further distinction, which also affects the 
specificity of hybridization, is occasioned by the typical 
derivation of EST microarray probes from cloned material. 

25 Because much of the probe material disposed as probes on 
EST microarrays is excised or amplified from plasmid, 
phage, or phagemid vectors, EST microarrays typically 
include a fair amount of vector sequence, more so when the 
probes are amplified, rather than excised, from the vector. 

30 In contrast- the vast man or it y of probes in the 

genome-derived single exon microarrays of the present 
invention contain no prokaryotic or bacteriophage vector 
sequence, having been amplified directly or indirectly from 
genomic DNA. Typically, therefore, at least about 50, 60, 

35 70 or 80% or more of individual exon-including probes 
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aisposea on a aenorne-derivea single exon rr L icrocrray o 
- r ^5fp.-*- - r-^e^^ior: lack vector s^^usnce , arici parrccoia 
lack sequences drawn froiri plasioids and 



Preferably, st least about 65, 90 or more ohan 90% of exon- 
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including przbes in the genome -derived single exon 
ir.icro array of --he present inversion lack vector sequenc 
With attention to removal of vector sequences through 
preprocessing 24, percentages of vectoi-fi.ee exon-mc. 
probes can be as high as 95 - 93%. The substantial absence 
of vector sequence from the genome-derived single exon 
microarrays of the present invention results in greater 
specificity during hybridization, since spurious cross- 
hybridization to a probe vector sequence is reduced. 

As a further consequence of excision or 

15 ampiincaiioii ui pruuca xj-^m v^~^^ — - 

microarrays, the probes arrayed thereon often contain 
artificial sequence, derived from vector poiylinker 
multiple cloning sites, at both 5« and 3 ' ends. The probes 
disposed upon the genome-derived single exon microarrays 
20 need have no such artificial sequence appended thereto. 

As mentioned above, however, the ORF-specif ic 

primers usea tu cimpiiiy p"^^^ — 

artificial sequences, typically 5' to the ORF-specif ic 
primer sequence, useful for "universal" (that is, 
25 independent of ORF sequence) priming of subsequent 
amplification or sequencing reactions. When such 
"universal" 5' and/or 3' priming sequences are appended to 
the amplification primers, the probes disposed upon the 
genome-derived single exon microarray will include 
■?,n artificial sequence similar to that found in EST 

microarrays. However, the genome-derived single exon 
microarray of the present invention can be made without 
such sequences, and if so constructed, presents an even 
smaller amount of nonspecific sequence that would 
35 contribute to nonspecific hybridization. 
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v P r another conseauence of "typical use ui cione: 
_ai as orobes in I.ST siicroarrays is that such 
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Tiate: 

microarrays contain probes that result from cloning 
artifacts, such as chimeric molecules containing ceding 
5 region of two separate genes. Derived from genomic 

material, typically not thereafter cloned, the probes of 

the aenome-derivea siivj±e exon uucruai.i3jfj ^ i- 

invention lack such cloning artifacts, and thus provide 
greater specificity of signal in gene expression 

10 measurements. 

A further consequence of the cloned origin of 
probes on many EST microarrays is that the individual 
probes often have disparate sizes, which can cause the 
optimal hybridization stringency to vary among probes on a 
15 single microarray. In contrast, as discussed above, the 
probes arrayed on the genome-derived single exon 
microarrays of the present invention can readily be 
designed to have a narrow distribution in sizes, with the 
range of probe sizes no greater than about 10% of the 
average size, typically no greater than about 5% of the 

average probe size. 

Because of their origin from fully- or partially- 
spliced message, probes disposed upon EST arrays will often 
include multiple exons. The percentage of such exon- 
5 spanning probes in an EST microarray can be calculated, on 
average, based upon the predicted number of exons/gene for 
the given species and the average length of the immobilized 
probes. For human genes, the near-complete sequence of 
human chromosome 22, Dunham et al . , Nature 402 ( 6761 ): 489-95 
(1999), predicts that human genes average 5.5 exons/gene. 
Even with probes of 200 - 500 bp, the vast majority of 
human EST microarray probes include more than one exon. 

In contrast, by virtue of their origin from 
algorithmically identified ORFs in genomic sequence, the 
rcbes in the genome-derived single exon microarrays of the 
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75 80, 35, 95 or 93% of probes deposited in one genome- 
derived. micrcarra\ T of one present: invention consist of, or 

5 inciude, no more ohan one predicoec ORF. 

This provides the ability, nor readily achieved 
using EST microarr ays , to use the genome-derived single 
exon microarrays of the present invention to measure 
tissue-specif io expression of individual exons, which in 

10 rum allows differencial splicing events to be detected and 
characterized, and in particular, allows the correlation of 
differential splicing tc tissue-specific expression 
patterns . 

Furthermore, the exons that are represented in 

15 EST microarrays are often biased toward tne 3 ! or t ' end or 
their respective genes, since sequencing strategies used 
for EST identification are so biased. In contrast, no such 
3' or 5' bias necessarily inheres in the selection of exons 
for disposition on the genome-derived single exon 

20 microarrays of the present invention. 

Conversely, the probes provided on the genome- 
derived single exon microarrays of the present invention 
typically, but need not necessarily, include intronic 
and/or intergenic sequence that is absent from EST 

25 microarrays, which are derived from mature mRHA. 

Typically, at least about 50, 60, 70, 80 or 90% of the 
exon-including probes on the genome-derived single exon 
microarrays of the present invention include sequence drawn 
from noncoding regions. As discussed above, the additional 

oq presence of nonccdino recrion does not significantly 

interfere with measurement of gene expression, and provides 
the additional opportunity to assay prespliced RNA, and 
thus measure such phenomena such as nuclear export control. 

The genome-derived single exon microarrays of the 

35 present invention are also quite different from in situ 
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synthesis microarrays , where probe s;ze is severely 
constrained by inadequacies m tne photo -...i tnograpnic 
synthesis pre cess . 

TvricallVr crones arrayed on in situ synthesis 
5 microarrays are United to a maximum of about 25 bp. As a 
well known consequence, hybridization to sucn chips must oe 
performed at low stringency. In order, therefore, to 
achieve unambiguous sequence-specific hybridization 
results, the in situ synthesis microarray requires 
10 substantial redundancy, with concomitant programmed 

arraying for each probe of probe analogues with altered 
(i.e., mismatched) sequence. 

In contrast, the longer probe length of the 
genome-derived single exori microarrays of the present 
15 invention allows much higher stringency hybridization and 
wash. Typically, therefore, exon-including probes on the 
genome-derived single exon microarrays of the present 
invention average at least about 100, 200, 300, 400 or 
500 bp in length. By obviating the need for substantial 
20 probe redundancy, this approach permits a higher density of 
probes for discrete exons or genes to be arrayed on the 
microarrays of the present invention than can be achieved 
for in situ synthesis microarrays. 

A further distinction is that the probes in in 
25 situ .synthesis microarrays typically are covalently linked 
to the substrate surface. In contrast, the probes disposed 
on the genome-derived microarray of the present invention 
typically are, but need not necessarily be, bound 
r,A^oATTni ^n+-^ ^ ^ o fho ^ub^t^atp. , 
30 Furthermore, the short probe size on in situ 

microarrays causes large percentage differences in the 
melting temperature of probes hybridized to their 
complementary target sequence, and thus causes large 
percentage differences in the theoretically optimum 
35 stringency across the array as a whole. 
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-r „ r.^-r^3s^ the Is^Qer crobo si zs in cne 
iTiicr jarrays of the present, invention create ~ower 
percentage differences in melting temperature across the 
range of arrayed probes. 

5 A farther significant advantage of the 

microarrays of the present invention over in situ 
synthesized arrays is that the quality of each individual 
probe can be confirmed before deposition. In contrast, the 
quality of probes cannot be assessed on a probe-by-probe 

10 basis for the in situ synthesized microarrays presently 
being used. 

The genome-derived single e.xon microarrays of the 
present invention are also distinguished over, and present 
substantial benefits over, the genome-derived microarrays 

15 from lower eukaryotes such as yeast. Lashkari et a.L . 
Proc. Natl. Acad. Sci . USA 94:13057-13062 (1997). 

Only about 220 - 250 of the 6100 or so nuclear 
genes in Saccharomyces cerevisiae - that is, only about 4 
- 5% - have standard, spliceosomal , introns, Lopez et al., 

20 Nucl. Acids Res. 28:85-86 (2000); Spingola et al. f RNA 
5(2) "221-34 (1999) . Furthermore f the entire yeast genome 
has already been sequenced. These two facts permit the 
ready amplification and disposition of single-ORF amplicons 
on such microarray without the requirement for antecedent 

25 use of gene prediction and/or comparative sequence 
analyses . 

Thus, a significant aspect of the present 
invention is the ability to identify and to confirm 
expression of predicted coding regions in genomic sequence 

30 drawn from eukaryotic organisms that have a higher 

percentage of genes having introns than do yeast such as 
Saccharomyces cerevisiae r particularly in genomic sequence 
drawn from eukaryotes in which at least about 10, 20 or 50% 
of protein-encoding genes have introns. In preferred 

35 embodiments, the methods and apparatus of the present 
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invention are used to identify and confirm expression of 
novel aeries froir. genomic sequence of euKeryotes in v;nich 
the average number of mtrons per gene is at -east about 
one, two or three or more. 
5 After the physical substrate is prepared, 

experimental verification ot predicted function is 
performed . 

In a preferred embodiment of the present 
invention, where the function sought to be identified in 
10 genomic sequence is protein coding, experimental 

verification is performed by measuring expression of the 
putative ORFs, typically through nucleic acid hybridization 
experiments, and in particularly preferred embodiments, 
through hybridization to genome-derived single exon 
15 microarrays prepared as above- described. 

Expression is conveniently measured and expressed 
for each probe in the microarray as a ratio of the 
expression measured concurrently in a plurality of mRNA 
sources, according to techniques well known in the 
20 microarray art, Reviewed in Schena et al., and as further 
described in Example 2, below. The mRNA source for the 
reference against which specific expression is measured can 
be drawn from a homogeneous mRNA source, such as a single 
cultured cell-type, or alternatively can be 'heterogeneous , 
25 as from a pool of mRNA derived from multiple tissues and/or 
cell types, as further described in Example 2, infra. 

mRNA can be prepared by standard techniques, see 
Ausubel et al . and Maniatis et al., or purchased 
__,___,-„.! -^i -i T^^ mRMA i <=. th^n tynicallv reverse- 
30 transcribed in the presence of labeled nucleotides: the 
index source (that in which expression is desired to be 
measured) is reverse transcribed in the presence of 
nucleotides labeled with a first label, typically a 
fluorophore (f luorochrome; fluor; fluorescent dye) ; the 
35 reference source is reverse transcribed in the presence of 
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a second label, typically a f luorcphore , typically 
f iuorornetrioaily-distinguishabie from the first label. As 
further described in Example 2, infra, Cy3 and Cy5 dyes 
prove oarticuiarly useful in these methods. After partial 
5 cur if icatior of the index and reference targets, 

hybridization to the probe array is conducted according to 
standard techniques, typically under a eoverslip. 

After wash, microarrays are conveniently scanned 
using a commercial microarray scanning device, such as a 

10 Gen3 Scanner (Molecular Dynamics, Sunnyvale, CA) . Data on 
expression is then passed, with or without interim storage, 
to process 500, where the results for each probe are 
related to the original sequence. 

Often, hybridization of target material to the 

15 genome-derived single exon microarray will identify certain 
of the probes thereon as of particular interest. Thus, it 
is often desirable that the user be able readily to obtain 
sufficient quantities of an individual probe, either for 
subsequent arrayed deposition upon an additional support 

20 substrate, often as part of a microarray having a plurality 
of probes so identified, or alternatively or additionally 
as a solitary solid-phase or solution-phase probe, for 
further use. 

Thus, in another aspect, the present invention 

25 provides compositions and kits for the ready production of 
nucleic acids identical in sequence to, or substantially 
identical in sequence to, probes on the genome-derived 
single exon microarrays of the present invention. 

In this aspect, a small quantity of each probe is 

spatially-addressable ordered set, typically one per well 
of a microtiter dish. Although a 96 well microtiter plate 
can be used, greater efficiency is obtained using higher 
density arrays, such as are provided by microtiter plates 
35 having 384, 864, 1536, 3456, 6144, or 9600 wells, and 
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although microliter plates having physical depressions 
(wells) are conveniently used, any device that permits 
addressaoie withdrawal of reagent from fiuidly- 
noncommunicating areas can be usea. 
5 In thus aspect of the invention, therefore, a 

fluidly noncommunicating addressable ordered set of 
individual probes, corresponding to those on a genome- 
derived single exon microarray, is provided, with each 
probe in sufficient quantity to permit amplification, such 

10 as by PCR. As earlier mentioned, the ORF-specific 

5 ' orimers used for genomic amplification can have a first 
common sequence added thereto, and the 0F:F-speci f ic 3' 
' primers used for genomic amplification can have a second, 
different, common sequence added thereto, thus permitting, 

15 in this preferred embodiment, the use of a single set of 5' 
and 3' primers to amplify any one of the probes from the 
amplifiable ordered set. 

Each discrete amplifiable probe can also be 
packaged with amplification primers, solutes, buffers, 

20 etc., and can be provided in dry (e.g., lyophilized) form 
or wet, in the latter case typically with addition of 
agents that retard evaporation. 

In another aspect of the present invention, a 
genome-derived single-exon microarray is packaged together 

25 with such an ordered set of amplifiable probes 

corresponding to the probes, or one or more subsets of 
probes, thereon. In alternative embodiments, the ordered 
set of amplifiable probes is packaged separately from the 
genome-derived single exon microarray. 

30 In some embodiments, the microarray and/or 

ordered probe set are further packaged with recordable 
media that provide probe identification and addressing 
information, and that can additionally contain annotation 
information, such as gene expression data. Such recordable 

35 media can be packaged with the microarray, with the ordered 
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o r g b e s s t- , cr w i t h both. 

If the nicroarray is constructed cc a substrate 
that incorporates recordable media, sush as is described in 
international patent application no. WO 93/12559, then 
5 separate packaging of the genome-derived single exon 
microarray and the bioinf ormatic information is not 
required . 

The amount of amplifiabie probe material should 
be sufficient to permit at least one amplification 

10 sufficient for subsequent hybridization assay. 

Although the use of high density genome-derived 
microarrays on solid planar substrates is presently a 
preferred approach for the physical confirmation and 
characterization of the expression of sequences predicted 

15 to encode protein, other types of microarrays (as herein 
defined) can also be used. 

Furthermore, as earlier mentioned, experimental 
verification of the function predicted from genomic 
sequence in process 200 can be bioinf ormatic, rather than, 

20 or additional to, physical verification. 

For example, where the function desired to be 
identified is protein coding, the predicted ORFs can be 
compared bioinf ormatically to sequences known or suspected 
of being expressed. 

25 Thus, the sequences output from process 300 (or 

process 200), can be used to query expression databases, 
such as EST databases, SNP ("single nucleotide 
polymorphism") databases, known cDNA and mRNA sequences, 
SAGE ("serial analysis of gene expression") databases, and 

^0 more generalized seouence databases that allow query for 
expressed sequences. Such query can be done by any 
sequence query algorithm, such as BLAST ("basic local 
alignment search tool"). The results of such query - 
including information on identical sequences and 

35 information on nonidentical sequences that have diffuse or 
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fecal regions of sequence homology tc the query sequence - 
can then be passed directly no process 5C0, :r used :o 
inform analyses subsequently undertaken in process 200 , 
process 1C0, or process 4 00. 
5 Experimental data, whether obtained by physical 

or bioinf or ma tic assay in process 400, is passed to process 
500 where it is usefully related to the sequence data 
itself, a process colloquially termed "annotation". Such 
annotation can be done using any technique that usefully 

10 relates the functional information to the sequence, as, for 
example, by incorporating the functional data into the 
record itself, by linking records in a hierarchical or 
relational database, by linking to external databases, or 
by a combination thereof. Such database techniques are 

15 well within the skill in the art. 

The annotated sequence data can be stored 
locally, uploaded to genomic sequence database 100, and/or 
displayed 800. 

The methods and apparatus of the present 

20 invention rapidly produce functional information from 
genomic sequence. Coupled with the escalating pace at 
which sequence now accumulates, the rapid pace of sequence 
annotation produces a need for methods of displaying the 
information in meaningful ways. 

25 FIG. 3 shows visual display 80 presenting a ] 

single genomic sequence annotated according to the present 
invention. Because of its nominal resemblance to artistic 
works of Piet Mondrian, visual display 80 is alternatively 
described herein as a "Mondrian". 

oa ip-aoV, r^-f i- In TTi cn = 1 o 1 otp o n -h q rs~F H "i <=; n 1 ;=i v ft f) 1 R 

aligned with respect to the genomic sequence being 
annotated (hereinafter, the "annotated sequence") . Given 
the number of nucleotides typically represented in an 
annotated sequence, representation of individual 
35 nucleotides would rarely be readable in hard copy output of 
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disci ay 80. Tyoi tally, thsiGi ere , the an no cat ed ssous^ce 
is schematized as rectangle 89, extending from the left 
border of display SO to its right border. 3y convention 
hereon, the lefo oorder of rectangle 3 9 represents the 
5 first nucleotide cf the sequence ana the right border of 
rectangle 89 represents the last nucleotide of the 
sequence . 

As further discussed below, however, the Mondrian 
visual display of annotated sequence can serve as a 

10 convenient graphical user interface for computerized 

representation, analysis, and query of information stored 
electronically. For such use, the individual nucleotides 
can conveniently be linked to the X axis coordinate of 
rectangle 89. This permits the annotated sequence at any 

15 point within rectangle 89 readily to be viewed, either 

automatically — for example, by time-delayed appearance of 
a small overlaid window upon movement of a cursor or other 
pointer over rectangle 89 - or through user intervention, 
as by clicking a mouse or other pointing device at a point 

20 in rectangle 89. 

Visual display 80 is generated after user 
specification of the genomic sequence to be displayed. 
Such specification can consist of or include an accession 
number for a single clone (e.g., a single BAC accessioned 

25 into GenBank) , wherein the starting and stopping 
nucleotides are thus absolutely identified, or 
alternatively can consist of or include an anchor or 
fulcrum point about which a chosen range of sequence is 
anchored, thus providing relative endpoints for the 

j\j og^ucah^c \^'u uiopxaySu. rui ^Acuupxt:, one u^tii Can cuidiur 

such a range about a given chromosomal map location, gene 
name, or even a sequence returned by query for similarity 
or identity to an input query sequence. When visual 
display 80 is used as a graphical user interface to 
35 computerized data, additional control over the first and 
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last displayed nucleotide will typically be dynamically 
tools . 

Field 81 of visual display 80 is used to present 
5 the output from process 200, that is, to present the 
biomf ormatic prediction of those sequences having the 
desired function within the genomic sequence. Functional 
sequences are typically indicated by at least one rectangle 
S3 (33a, 83b, 83c), the left and right borders of which 

10 respectively indicate, by their X-axis coordinates, the 

starting and ending nucleotides of the region predicted to 
have function. 

Where a single bioinf ormatic method or approach 
identifies a plurality of regions having the desired 

15 lunction, a plurality ot rectangles b3 is disposed 

horizontally in field 81. Where multiple methods and/or 
approaches are used to identify function, each such method 
and/or approach can be represented by its own series of 
horizontally disposed rectangles 83, each such horizontally 

20 disposed series of rectangles offset vertically from those 
representing the results of the other methods and 
appro acnes . 

Thus, rectangles 83a in FIG. 3 represent the 
functional predictions of a first method of a first 
25 approach for predicting function, rectangles 83b represent 
the functional predictions of a second method and/or second 
approach for predicting that function, and rectangles 8 3c 
represent the predictions of a third method and/or 
approach . 

30 Where the function desired to be identified is 

protein coding, field 81 is used to present the 
bioinf ormatic prediction of sequences encoding protein. 
For example, rectangles 83a can represent the results from 
GRAIL or GRAIL II, rectangles 83b can represent the results 

35 from GENEFINDER, and rectangles 83c can represent the 

49 



WO 01/57274 PCT/US01/00666 

Optionally, and preferably, rectangles 83 
collectively representing predictions of a single method 
and/or approach are identically colored and/or textured, 
5 and are distinguishable from the oolor and/or texture used 
for a different method and/or approach. 

Alternatively, or in addition, the color, hue, 
density, or texture of rectangles 33 can be used further to 
report a measure of the bioinf orrnat ic reliability of the 

10 prediction. For example, many gene prediction programs 
wil] report a measure of the reliability of prediction. 
Thus, increasing degrees of such reliability can be 
indicated, e.g., by increasing density of shading. Where 
display 30 is used as a graphical user interface, such 

15 measures of reliability, and indeed all other results 

output by the program, can additionally or alternatively be 
made accessible through linkage from individual rectangles 
83, as by time-delayed window ("tool tip T ' window), or by 
pointer (e.g., mouse) -activated link. 

20 As earlier described, increased predictive 

methods and/or approaches to determining function. Thus, 
field 81 can include a horizontal series of rectangles 83 
that indicate one or more degrees of consensus in 
25 predictions of function. 

Although FIG. 3 shows three series of 
horizontally disposed rectangles in field 81, display 80 
can include as few as one such series of rectangles and as 

30 number of methods and/or approaches used to predict a given 
function . 

Furthermore, field 81 can be used to show 
predictions of a plurality of different functions. 
However, the increased visual complexity occasioned by such 
35 display makes more useful the ability of the user to select 
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such function oar usefully be indicated and user- 
selectable, as by a series of graphical burtons or tabs 
5 (not s h o w n in FIG. 3 ) 

Rectangle 39 is shown in FIG. 3 as including 
interposed rectangle 84. Rectangle 84 represents the 
portion of annotated sequence for which predicted 
functional information has been assayed physically, with 
10 the starting and ending nucleotides of the assayed material 

i n rl t r> D f H K t t -r- Vi £1 V q v ^ ^ r.-] i ri q f a Q ~f~ 4~ o lo"^""^ 3 m ^-J r -i r^lo'l- 

-nai^u l-'^u ^ y £\ d ±\ o ou^i. uj-;1u lCij wj_ uiiC _j_ o a_ ^ Q. 1 1 j. J.y ii l 

borders of rectangle 84. Rectangle 35, with optional 
inclusive circles 86 (86a, 86b, and 86c) displays the 
results of such physical assay. 

15 Although a single rectangle 84 is shown in FIG. 

3, physical assay is not limited to just one region of 
annotated genomic sequence. It is expected that an 
increasing percentage of regions predicted to have function 
by process 200 will be assayed physically, and that display 

20 80 will accordingly, for any given genomic sequence, have 
an rncreasmg number of rectangles 84 and 85, representing 
an increased density of sequence annotation. 

Where the function desired to be identified is 
protein coding, rectangle 84 identifies the sequence of the 

25 probe used to measure expression. In embodiments of the 
present invention where expression is measured using 
genome-derived single exon microarrays, rectangle 84 
identifies the sequence included within the probe 
immobili zed on the support surface of the mi c roar ray . As 

30 noted supra, such probe will often include a small amount 
of additional, synthetic, material incorporated during 
amplification and designed to permit reamplif ication of the 
probe, which sequence is typically not shown in display 80. 

Rectangle 87 is used to present the results of 

35 bioinf ormatic assay of the genomic sequence. For example, 
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where the function desired to be identified is protein 
coding, process 4 00 :an include bioinf crrna lie query of 
expression databases with the sequences predicted in 
process 200 to encode exons . And as earlier discussed, 
5 because bioinf ormatic assay presents fev/er constraints than 
does physical assay, often the entire output of process 200 
can be used for such assay, without further subsetting 
thereof by process 300. Therefore, rectangle 37 typically 
need not have separate indicators therein of regions 

10 submitted for bioinf ormatic assay; that is, .rectangle 87 
typically need not have regions therein analogous to 
rectangles 34 within rectangle 89. 

Rectangle 37 as shown in FIG. 3 includes smaller 
rectangles 380 and 83. Rectangles 880 indicate regions 

15 that returned a positive result in the bioinf ormatic assay, 
with rectangles 88 representing regions that did not return 
such positive results. Where the function desired to be 
predicted and displayed is protein coding, rectangles 880 
indicate regions of the predicted exons that identify 

20 sequence with significant similarity in expression 
databases, such as EST, SN?, SAGE databases, with 
rectangles 88 indicating genes novel over those identified 
in existing expression data bases. 

Rectangles 880 can further indicate, through 

25 color, shading, texture, or the like, additional 
information obtained from bioinf ormatic assay. 

For example, where the function assayed and 
displayed is protein coding, the degree of shading of 
rectangles 880 can be used to represent the degree of 

30 sequence similarity found upon query of expression 

databases. The number of levels of discrimination can be 
as few as two (identity, and similarity, where similarity 
has a user-selectable lower threshold) . Alternatively, as 
many different levels of discrimination can be indicated as 

35 can visually be discriminated. 
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Where display 60 :s used as a graphical user 
interface, reatanores 880 can aadicicna- ly provide -inks 
directly to one sequences identified by the query of 
expression uatacases, ano/or statistical summaries thereof. 
5 As with eacn of the ore cetiingly-dis :ussed uses of display 
8 3 as a graphical user interface, it should be understood 
that the information accessed via display 80 need not be 
resident on one computer presenting such display, which 
often will be serving as a client, with the linked 
10 information resident on one or more remotely located 
servers . 

Rectangle 85 displays the results of physical 
assay of the sequence delimited by its left and right 
borders . 

15 Rectangle 85 can consist of a single rectangle, 

thus indicating a single assay, or alternatively, and 
increasingly typically, will consist of a series of 
rectangles (35a, 85b, 85c) indicating separate physical 
assays of the same sequence. 

20 Where the function assayed is gene expression, 

and where gene expression is assayed as herein described 
using simultaneous two-color fluorescent detection of 
hybridization to genome-derived single exon microarrays, 
individual rectangles 85 can be colored to indicate the 

25 degree of expression relative to control. Conveniently, 
shades of green can be used to depict expression in the 
sample over control values, and shades of red used to 
depict expression less than control, corresponding to the 
spectra of the Cy3 and Cy5 dyes conventionally used for 

30 respective labeling thereof. Additional functional 

information can be provided in the form of circles 86 (86a, 
86b, 86c), where the diameter of the circle can be used to 
indicate expression intensity. As discussed intra, such 
relative expression (expression ratios) and absolute 

35 expression (signal intensity) can be expressed using 
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n 3 rnal i z e d values. 

Where display 80 :s usej as a grapnrcar user 
interface, rectangle 35 can be used as a link to further 
information about the assay. For example, where the assay 
5 is one fir gene expression, each rectangle 85 can be used 
to link to information about the source of the hybridized 
mRNA, the identity of the control, raw or processed data 
from the microarray scan, or the like. 

FIG. 4 is rendition of display 30 representing 

10 gene prediction and gene expression for a hypothetical 3AC, 
showing conventions used in the Examples presented infra. 
BAG sequence {"Chip seq.") 3 9 is presented, wioh the 
physically assayed region thereof (corresponding to 
rectangle 8 4 in. FIG . 5) shown in white. Algorithmic gene 

15 predictions are shewn in field 31, with predictions by 

GRAIL shown, predictions by GENEFINDER, and predictions by 
DICTION shown. Within rectangle 87, regions of sequence 
that, when used to query expression databases, return 
identical or similar sequences ("EST hit") are shown as 

20 white rectangles (corresponding to rectangles 880 in FIG. 
3) , gray indicates low homology, and black indicates 
unknowns (where black and gray would correspond to 
rectangles 83 in FIG. 3) . 

Although FIGS. 3 and 4 show a single stretch of 

25 sequence, uninterrupted from left to right, longer 

sequences are usefully represented by vertical stacking of 
such individual Mondrians, as shown in FIGS. 3 and 10. 

Si ngle Exon Probes Useful For Measuring Gene Expression 

-t r. 

The methods and apparatus of the present 
invention rapidly produce functional information from 
genomic sequence. Where the function to be identified is 
protein coding, the methods and apparatus of the present 
35 invention rapidly identify and confirm the expression of 
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aenom.ic sequence that function to encoce 
p^ n r^- ; r- ^5 3 etc set result; the net hods and apparatus ci 
the present invention rapidly yield large numbers of 
smaie-exon nucleic acid probes, the majority from 

5 previously unknown genes, each of whien is useful for 

measuring and/or surveying expression of a specific gene in 
cne or more tissues or ceil types. 

It is, therefore, another aspect of the present 
invention to provide genome-derived single exon nucleic 

10 acid probes useful for gene expression analysis, and 

particularly for gene expression analysis by microarray. 

Using the methods and genome-derived single-exon 
microarrays of the present invention, we have for example 
readily identified a large number of unique ORFs from human 

15 genomic sequence. Using single exon probes that encompass 
these ORFs, we have demonstrated, through microarray 
hybridization analysis, the expression of 9,980 of these 

OP.Fs in heart. 

As would immediately be appreciated by one of 
20 skill in the art, each single exon probe having 

demonstrable expression in heart is currently available for 
use in measuring the level of its ORF's expression in 
heart. 

Diseases of the heart and vascular system are a 
25 significant cause of human morbidity and mortality. 
Increasingly, genetic factors are being found that 
contribute to predisposition, onset, and/or aggressiveness 
of most, if not all, of these diseases. Although mutations 
in single genes have on occasion been identified as 
30 causative, rnese aisorueis die ±v±. — — 

to have polygenic etiologies. 

For example, cardiovascular disease (CVD) , which 
includes coronary heart disease, stroke, and peripheral 
arterial vascular disease, is the leading cause of death in 
35 the United States and other developed countries. in 
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mortality. In the United States alone, about 1 million 
death? (about 42% of total, deaths per year) resulo from CVD 
5 each year. CVD is also a significant cause of morbidity, 
with about 1.5 million people suffering myocardial 
infarction, and about 500,000 suffering strokes in the 
United States each year. with risk for CVD increasing with 
age, and an increasingly aging population, CVD will 
10 continue to be a major health problem into the future. 



fatty streaks, which consist of lipid-laden foam cells,, and 
develop into fibrous plaques. The atherosclerotic plaque 
may grow slowly, ' and over several decades may produce a 

15 severe stenosis or result in arterial occlusion. Some 
plaques are stable, but other, more unstable, ones may 
rupture and induce thrombosis. The thrombi may embolize, 
rapidly occluding the lumen and leading to myocardial 
infarction or acute ischemic syndrome. 

20 Risk factors for CVD include age and gender. In 

addition, a family history of CVD significantly increases 
risk,, indicating a genetic basis for development of this 
disease complex. Obesity, especially truncal obesity, the 
cause of which is suspected to be genetic, is yet another 

25 risk factor for CVD. Familial disorders such as 
hyperlipidemia, hypoalphalipoproteinemia, 
hypertriglyceridemia, hypercholesterolemia, 
hyperinsulinemia , homocyst inuria , and 

dysbetalipoproteinemia, all of which lead to lipid or 
30 lipoprotein abnormalities , can predispose one to tne 
development of CVD, Both insulin-dependent and non- 
insulin-dependent diabetes mellitus, both of which have 
genetic components, have been also linked to the 
development of atherosclerosis. 
35 The literature is replete with evidence for 



CVD is caused by arterial lesions that begin as 
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a o ne +-: C ca n ses 0 f cardiovascular diseases. For example, 
^^--jri-in^ by AVlavee et al., Air.. c. Hurr. . Genet. 6^:577- 
5^5 '1998) indicated a genetic association between familial 
combined hyper lipidemia (FCHL) and small dense LDL 
5 ^articles- The studies also concluded that: the genetic 
determinants fcr LDL particle size are shared, at least in 
part, among FCHL families and the more general population 
at risk for CVD. Juo et al., Am,. J. Hum. Genet. 63: 586- 
594 (1998) demonstrated that small, dense LDL particles and 
]0 elevated apolipoprotein B levels, both of which are 

^^ 11r ^H -in momh»T-Q. nf p^CWT, families, share a common 
major gene plus individual polygenic components. 

The common major gene was estimated to explain 37% of 
the variants of adjusted LDL particle size and 23% of the 
15 variants of adjusted apoB levels. 

The atherogenic lipoprotein phenotype (ALP) is a 
common heritable trait, symptoms of which include a 
prevalence of small, dense LDL particles, increased levels 
of triglyceride-rich lipoproteins, reduced levels of high 
20 density lipoprotein, and increased risk of CVD, 

particularly myocardial infarction. Both Nishina et al., 
Proc. Nat. Acad. Sci. 89; 708-712 (1992) and Rotter et al., 
Am. J. Hum. Genet. 58: 585-594(1996) demonstrated linkage 
between ALP and the LDLR locus. Rotter et al . , supra, also 
25 reported linkage to the CETP locus on chromosome 16 and to 
the SOD! locus on chromosome 6, and possibly also to the 
AP0A1 /APOC3/APOA4 cluster on chromosome 11. 

Mutations in genes identified as components of 
lipid me LdDui ±^m, e.g., djjuii^w^^v^v^i. — \ - ir ^ — ' 
30 receptor (LDLR) , have been shown to be associated with 
predisposition to the development of CVD. For example, 
several apoE variants had been found to be associated with 
familial dysbetaiipoproteinemia , characterized by elevated 
plasma cholesterol and triglyceride levels and an increased 
35 risk for atherosclerosis (de Knijff et al., Mutat 4: 178- 
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194 '1994:; . Mutations in one LDLR gone have been 

autosomal aorr.inant disorder characterized by elevation of 
serum cholesterol bound to low density lipoprotein (LCL) , 
5 that can lead to- increased susceptibility to CVD. 

Tc date, mutations in numerous genes have been 
shown to be associated with increased CVD susceptibility. 
However, the identified tenet ic associations are believed 
not to account for all genetic contributions to CVD. 

10 As yet another example, hypertension is a majcr 

health problem because of its high prevalence and its 
association with increased risk of CVD. Approximately 25% 
of ail adults and over 60% of persons older than 60 years 
in the United States have hiah blood pressure, 

15 Arterial or systemic hypertension is diagnosea 

when the average of two or more diastolic BP measurements 
on at least two subsequent visits is 90 mm Hg or more, or 
when the average of multiple systolic BP readings on two o 
more subsequent visits is consistently greater than 140 mm 

20 Hg. Pulmonary hypertension is defined as pressure within 
the pulmonary arterial system elevated above the normal 
range; pulmonary hypertension may lead to right ventricle 
(RV) failure. 

Hypertension, together with other cardiovascular 

25 risk factors, leads to atherosclerosis and other forms of 
CVD, primarily by damaging the vascular endothelium. In 
more than 40% of the U.S. population, hypertension is 
accompanied by hyperlipidemia and leads to the development 
of atherosclerotic plaques. In the absence of 

^0 hyperlipidemia, intimal thickening occurs. Non- 
atherosclerotic hypertension-induced vascular damage can 
lead tc stroke or heart failure. 

Familial diseases associated with secondary 
hypertension include familial renal disease, polycystic 

35 kidney disease, medullary thyroid cancer, pheochromocytoma 
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and hyperparathyroidism. Hypertension is also twice as 
cormor. in oacients with diabetes moll it us. 

More than 95% of ail hypertension cases are 
essential hypertension, that is, lack identifiable 
5 antecedent clinical cause. Essential hypertension shows 
clustering in families and can result from a variety of 
genetic diseases. In most cases, high blood pressure 
results from a complex interaction of factors with both 
genetic and environmental components. The recent search 

10 for genes that contribute to the development of essential 
hypertension has shown that the disorder is polygenic in 
origin. However, with several exceptions (such as 
angiotensinogen, angiotensin receptor-1, beta-3 subunit of 
guanine nucieotide-binding protein, tumor necrosis factor 

15 receptor-2, and "-adducin) , the particular genes involved 
are still being sought. 

Susceptibility loci for essential hypertension 
have been mapped to chromosomes 17 and 15q. Hasstedt et 
al., Am. J. Hum. Genet. 43: 14-22 (1988) measured red cell 

20 sodium in 1,800 normotensive members of 16 Utah pedigrees 
ascertained through hypertensive or normotensive probands, 
siblings with early stroke death, or brothers with early 
coronary disease, and suggested that red blood cell sodium 
was determined by 4 alleles at a single locus. This major 

25 locus was thought to explain 29% of the variance in red 
cell sodium, and polygenic inheritance explained another 
54.6%. A higher frequency of the high red blood cell sodium 
genotype in pedigrees in which the proband was hypertensive 
rather than normotensive provided evidence that this major 

1 Pi 1 /-«. ^ i i r* t norn^cpr oncrcnf Thl 1 H\7 f" r\ Pi \ 7 #=> T* t" (=> H =3 1 D Fl . 

J \J J» Ww U O ili'-J- ^UOC J U4. O ^ J_ . l .i. -K. ^- J ^- — ' 'J — . 

From a study of systolic blood pressure in 278 
pedigrees, Perusse et al., Am. J. Hum. Genet. 49: 94-105 
(1991) reported that variability in systolic blood pressure 
is likely influenced by allelic variation of a single gene, 
35 with gender and age dependence. They also suggested that a 
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single gene may be associated with a steeper increase of 
::. .1 i . i pressure with age among males and females. 

There is strong evidence, however, far additional 
as yet uneharacter i zed, hypertension-associated loci on 
5 other chromosomes. 

For example, Xu et ai., Am. J. Hum. Genet. 64: 
1694-1701 (1999) carried out a systematic search for 
chromosomal regions containing genes that regulate blood 
pressure by scanning the entire autosomal genome using 367 

10 polymorphic markers. Because of the sampling design, the 
number of sib pairs, and the availability of genotyped 
parents, this study represented one of the most powerful of 
its kind. Although no regions achieved a 5% genomewide 
significance level, maximum lod scores were greater than 

15 2.0 for regions of chromosomes 3, 11, 15, 16, and 17. 

As another example, cardiac arrhythmias account 
for several thousand deaths each year. Arrhythmias such as 
ventricular fibrillation, which causes more than 300,000 
sudden deaths annually in the United States alone, 

20 encompass a multitude of disorders. Another type of 

arrhythmia, idiopathic dilated cardiomyopathy, of which 
familial dilated cardiomyopathy accounts for 20-25%, is 
responsible for more than 10,000 deaths in the United 
States annually and is the predominant indication for 

25 cardiac transplantation. 

Cardiac arrhythmias can be divided into 
bradyarrhythmias (slowed rhythms) or tachyarrhythmias 
(speeded rhythms). Bradyarrhythmias result from 
abnormalities of intrinsic automatic behavior or 

_> \J 1_V^11, Jt - ' J L1UW J. J, J. j *V -l. i_ 1 1 _l- * i i_ J 1 i^j t_ .1- \^ v j. i t_ j_ _i_ ^l- -j_ .1. a u u i_a * a 

the His-Purkinje ' s network. Tachyarrhythmias are caused by 
altered automaticit y, reentry, or triggered automaticity . 

Bradyarrhythmias arising from suspected polygenic 
disorders include Long QT syndrome 4, atrioventricular 
35 block, familial sinus node disease, progressive cardiac 
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conduction defect, and familial cardiomyopathy. 
Tachyarrhythmias with pcssicle underlying polygenic causes 
include f amil ial ventricular tachycardia , Wolf f -Par kin son - 
White syndrome, familial ar rhytnmogenic right ventricular 
5 dysplasia, heart-hand syndrome V, Mai de Meieda, familial 
ventricular fibrillation, and familial noncompact ion of 
left ventricular myocardium. 

For some of the arrhythmias, one or more of the 
causative genes have been identified. 

10 For example, atrioventricular block has been 

associated with mutations in the SCN5A gene, as well as 
mutations in a locus mapped to 19ql3. Studies have shown 
linkage of familial sinus node disease to a marker on 
10q22-q24. Familial ventricular tachycardia has been 

15 linked to mutations in genes encoding the G protein subunit 
alpha-i2 (GNAI1) , and/or related genes. Examination of 
families with Wolff-Parkinson-White syndrome suggest an 
autosomal dominant pattern of inheritance and evidence of 
linkage of the disorder to DNA markers on band 7q3. 

20 Linkage analysis shows strong evidence for localization of 
a gene for Mai de Meieda disease on 8qter. Familial 
ventricular fibrillation can be caused by mutations in the 
cardiac sodium channel gene SCN5A. Familial noncompaction 
of left ventricular myocardium has been linked to mutations 

25 in the gene encoding tafazzin (TAZ) , or in the FK506- 
binding protein 1A gene (FKBP1A) . 

Familial dilated cardiomyopathy is characterized 
by an autosomal dominant pattern of inheritance with age- 
related penetrance. The linkage of familial dilated 

in n-^r^i ahu rnn a In t r r\ caTrcir^ I 1 nn' i nH"- P^f P t h rl t it IS 
ju '^ai ui ui n y w£ r "—it_ii.j r i - ^ j i^\w-v^--L-i_*- i - _L.w^-i_ — — — — 

polygenic. These loci include CMD1A on lpll-qll, CMDlB on 
9ql3, CMD1C on 10q21, CMD1D on lq32, CMD1E on 3p, CMD1F on 
6q, CMD1G on 2q31, CMDlH on 2ql4-q22, and CMD1I, which 
results from mutation in the DES gene on 2q35. 
35 In addition, cardiomyopathy can also be caused by 
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ni^ar^ns i r the ACTC aer.e, the cardiac beta-ir.yosin heavy 
chair: aene (MYH7), or che caraiac troponin T gene. 

Familial arrhythmogenic right ventricular 
dysplasia is inherited as an autosomal dominant with 

5 reduced penetrance and is one of the major genetic causes 
of juvenile sudden death. It is estimated that the 
prevalence of familial arrhythmogenic right ventricular 
dysplasia ranges from 6 per 10,030 in the general 
population to 4.4 per 1,000 in some areas. 

10 Several loci for familial arrhythmogenic right 

ventricular dysplasia have been mapped indicating that this 
disease is also polygenic in nature. These loci include 
ARVD1 on 14q23-q24, ARVD2 on Iq42-q43, ARVD3 on 14ql2-q22, 
ARVD4 on 2q32 . l-q32 . 3 , ARVD5 on 3p23, and ARVD6 on 10pl4- 

15 pl2. 

Progressive cardiac conduction defect (PCCD) , 
also called Lenegre-Lev disease, is one of the most common 
cardiac conduction diseases. It is characterized by 
progressive alteration of cardiac conduction through the 

20 His-Purkinje system with right or left bundle branch block 
and widening of QRS complexes, leading to complete 
atrioventricular block and ultimately causing syncope and 
sudden death. It represents the major cause of pacemaker 
implantation in the world {0.15 implantations per 1,000 

25 inhabitants per year in developed countries) . The cause of 
PCCD is unknown but familial cases with right bundle branch 
block have been reported suggesting that at least some 
cases are of genetic origin. Reports have linked PCCD to 
HB1 on 19ql3,3, and to mutations in the SCN5A gene (Schott 

30 et al., Nature Genet. 23: 20-21 (1999)). 

As yet a further example, congenital heart 
disease occurs at a rate of 8 per 1000 live births, which 
corresponds to approximately 32,000 infants with newly 
diagnosed congenital heart disease each year in the United 

35 States. Twenty percent of infants with congenital heart 
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disease die within the first year of life. Approximately 
30* of one first-year survivors live to reaon adulthood. 
Congenital heart (disease also has economic impact due oo 
ohe estimated 20,000 surgical procedures performed to 
5 correct circulatory defects in these patients. Tne 

estimated number of adults with congenital heart disease in 
the United States is currently about 900,000. 

In &0% of patients, congenital heart disease is 
attributable to multifactorial inheritance. Only 5-10% of 

10 malformations are due to primary genetic factors, which are 
either chromosomal or a result of a single mutant gene. 

The most common congenital heart disease found in 
adults is bicuspid aortic valve. This defect occurs in 2% 
of the general population and accounts for approximately 

15 50% of operated cases of aortic stenosis in adults. Atrial 
septal defect is responsible for 30-40% of congenital heart 
disease seen in adults. The most common congenital cardiac 
defect observed in the pediatric population is ventricular 
septal defect, which accounts for 15-20% of all congenital 

20 lesions. Tetralogy of Fallot is the most common cyanotic 
congenital anomaly observed in adults. Other congenital 
heart diseases include Eisenmenger ' s syndrome, patent 
ductus arteriosus, pulmonary stenosis, coarctation of the 
aorta, transposition of the great arteries, tricuspid 

25 atresia, univentricular heart, Ebstein's anomaly, and 
double-outlet right ventricle. 

A number of studies have identified putative 
genetic loci associated with one or more congenital heart 
diseases . 

30 Concenital heart disease affects more than 4 0% of 

ail Down syndrome patients. The candidate chromosomal 
region containing the putative gene or genes for congenital 
heart disease associated with Dovjn syndrome is 2iq22.2- 
q22.3, between ETS2 and MX1. 

35 DiGeorge syndrome (DGS) is characterized by 
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several symptoms including outflow tract defects cf the 
hear: such as teratology of Fallot. Most cases result from 
a deletion of chromosome 22qll.2 (the DiGeorge syndrome 
chrcmosome region, or DGCR) . The 22qll deletion is the 
5 second most common cause of congenital heart disease after 
Down syndrome. Several genes are lost in this deletion 
including the putative transcription factor TUPLE1. This 
deletion is associated with a variety of phenotypes, e.g., 
Shprintzen syndrome; conotruncal anomaly face (or Takao 

10 syndrome) ; and isolated outflow tract defects of the heart 
including Tetralogy of Fallot, truncus arteriosus, and 
interrupted aortic arch. 

Whereas 90% of case-s of DGS may now be attributed 
to a 1 2 o 1 1 d e l e t r o n , o t n e r a s s o c i a t e g c n r omo s ome GGf sets 

15 have been identified. For example, Greenberg et a]., Am. 
J. Hum. Genet. 43:605-611 (1988), reported 1 case of DGS 
with dell0pl3 and one with a 18q21.33 deletion. Fukushima 
et al., Am. J. Hum. Genet. 51 (suppl.):A80 (1992) reported 
linkage with a deletion of 4q21.3-q25. Gottlieb et al., 

20 Am. J. Hum. Genet. 62: 495-498 (1998) concluded that the 
deletion of more than 1 region on lOp could be associated 
with the DGS phenotype. The association of the DiGeorge 
syndrome with at least 2 and possibly more chromosomal 
locations suggests strongly the involvement of several 

25 genes in this disease. 

Digilio et al., J. Med. Genet. 34: 188-190 
(1997), calculated empiric risk figures for recurrence of 
isolated Tetralogy of Fallot in families after exclusion of 
del(22qll), and concluded that gene(s) different from those 

30 located on 22qll must be involved in causing familial 

aggregation of nonsyndromic Tetralogy of Fallot. Johnson 
et al., Am. J. Med. Genet. (1997) conducted a cytogenetic 
evaluation of 159 cases of Tetralogy of Fallot. They 
reported that a del(22qll) was identified in 14% who 

35 underwent fluorescence in situ hybridization (FISH) testing 
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with the N2 5 cosmid probe. 



congenital heart disease ere 



suspects 



5 



to be of polygenic origin. For example, Holmes et ai . , 
Birth Defects Orig. Art. Ser. X(4): 223-230 (1974) 
described famiiiai clustering of hypoplastic left heart 
syndrome in siblings consistent with multifactorial 



causation . 



Other significant diseases of the heart and 



vascular system; are also believed to have a genetic, 

10 typically polygenic, etiological component. These diseases 
include, for example, hypoplastic left heart syndrome, 
cardiac valvular dysplasia, Pfeiffer cardiocranial 
syndrome, oculof aciocardiodental syndrome, Kapur-Toriello 
syndrome, Sonoda syndrome, Ohdo Biepharophimcsis syndrome, 

15 heart-hand syndrome, Pierre-Robin syndrome, Hirschsprung 
disease, Kousseff syndrome, Grange occlusive arterial 
syndrome, Kearns-Sayre syndrome, Kartagener syndrome, 
Aiagille syndrome, Ritscher-Schinzei syndrome, Ivemark 
syndrome, Young-Simpson syndrome, hemochromatosis, 

20 Holzgreve syndrome, Barth syndrome, Smith-Lemli-Opit z 

syndrome, glycogen storage disease, Gaucher-like disease, 
Fabry disease, Lowry-Maclean syndrome, Rett syndrome, Opitz 
syndrome, Marfan syndrome, Miller-Dieker lissencephaly 
syndrome, mucopolysaccharidosis, Bruada syndrome, 

25 humerospinal dysostosis, Phaver syndrome, McDonough 
syndrome, Marfanoid hypermobility syndrome, 
atransf errinemia, Cornelia de Lange syndrome, Leopard 
syndrome, Diamond-Blackf an anemia, Steinfeld syndrome, 
progeria, and Wiliiams-Beuren syndrome. 



probes and microarrays of the present invention are useful 
for predicting, diagnosing, grading, staging, monitoring 
and prognosing diseases of human heart and vascular system, 
particularly those diseases with polygenic etiology. With 
35 each of the single exon probes described herein shown to be 
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expressed at detectable levels In human heart, and with 
abcut 2/3 of the probes identifying novel genes, the single 
exon microarrays of the present invention provide 
exceptionally high informational content for such studies. 
5 For example, diagnosis (including differential 

diagnosis among clinically indistinguishable disorders), 
staging, and/or grading of a disease can be based upon the 
quantitative relatedness of a patient gene expression 
profile to one or more reference expression profiles known 

10 to be characteristic of a given heart or vascular disease, 
or to specific grades or stages thereof. 

In one embodiment; the patient gene expression 
profile is generated by hybridizing nucleic acids obtained 
di^^c^ly or indirectly from transcripts expressed in the 

15 patient's heart or vascular tissues to the genome-derived 
single exon microarray of the present invention. Reference 
profiles are obtained similarly by hybridizing nucleic 
acids obtained directly or indirectly from transcripts 
expressed in heart or vascular tissue of individuals with 

20 known disease. Methods for quantitatively relating gene 
expression profiles, without regard to the function of the 
protein encoded by the gene, are disclosed in WO 99/58720, 
incorporated herein by reference in its entirety. 

In another approach, the genome-derived single 

25 exon probes and microarrays of the present invention can be 
used to interrogate genomic DNA, rather than pools of 
expressed message; this latter approach permits 
predisposition to and/or prognosis of heart or vascular 
disease to be assessed through the massively parallel 

m ^ ^ +- A ^ ^ 4- -\ -F "I •*- v o r*r\TT\\7 nnmnpr HpI pf" "i ,on . DT mutation 
J\J ucLCiuiJ-iiau^uii vs. w. jl. L_ *w x. <w ^ l - y ¥ j n^^t^^^, w.w-^w^.- r 

in the patient's genome of exons known to be expressed in 
human heart. The algorithms set forth in WO 99/58720 can 
be applied to such genomic profiles without regard to the 
function of the protein encoded by the interrogated gene. 
35 The utility is specific to the probe; at 
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sufficiently high hybr idizaci cr stringency, which 
<z-. c t i n o e r ■ c i e s are w e i i jc n o v. T n in the art ~~ see Ik u s u b q 2 e t a 1 . 
and Maniatis e: ai. - each probe reports the level of 
expression of message specifically containing that ORE. 
5 it should be appreciated, however, that the 

probes of the present invention, for which expression in 
the heart has been demonstrated are useful for both 
measurement in the heart and for survey of expression in 
other tissues. 

10 Significant among such advantages is the presence 

of probes for novel genes. 

As mentioned above and further detailed in 
Examples 1 and 2, the methods described enable ORFs which 
are not present in e x i s t i ^ o expression databases to be 

15 identified. And the fewer the number of tissues in which 
the ORE can be shown to be expressed, the more likely the 
ORE will prove to be part of a novel gene: as further 
discussed in Example 2, ORFs whose expression was 
measurable in only a single of the tested tissues were 

20 represented in existing expression databases at a rate of 
only 11%, whereas 36% of ORFs whose expression was 
measurable in 9 tissues were present in existing expression 
databases, and fully 45% of those ORFs expressed in all ten 
tested tissues were present in existing expressed sequence 

25 databases. 

Either as tools for measuring gene expression or 
tools for surveying gene expression, the genome-derived 
single exon probes of the present invention have 
significant advantages over the cDNA or EST-based probes 

30 that are currently available for achieving these utilities. 

The genome-derived single exon probes of the 
present invention are useful in constructing genome-derived 
single exon microarrays; the genome-derived single exon 
microarrays, in turn, are useful devices for measuring and 

35 for surveying gene expression in the human. 

67 



WO 01/57274 PCT/US01/00666 

Gene exsressicr. analysis using microarrays — 
conventionally using microarrays having probes derivea rrom 
expressed message - is well-established as useful in the 
biological research arts (see Lockhart et ai . Nature 405, 
5 8 2 '7 - 8 3 6 ) . 

Microarrays have been used to determine gene 
expression profiles in cells in response to drug treatment 
(see, for example, Kaminski et al., "Global Analysis of 
Gene Expression in Pulmonary Fibrosis Reveals Distinct 

10 Programs Regulating Lung Inflammation and Fibrosis," Proc. 
Natl. Acad. Sci. USA 97 {4) : 1778-83 (2000); Bartosiewicz et 
al., "Development of a Toxicological Gene Array and 
Quantitative Assessment of This Technology," Arch. Biochem. 
Biophys. 376(1): 66-73 (2000;), viral infection (see for 

15 example, Geiss et al . , "Large-scale Monitoring of Host Cell 
Gene Expression During HIV-1 Infection Using cDNA 
Microarrays," Virology 266 ( 1 ): 8-16 (2000)) and during cell 
processes such as differentiation, senescence and apoptosis 
(see, for example, Shelton et al . , "Microarray Analysis of 

20 Replicative Senescence," Curr. Biol. 9(17): 939-45 (1999); 
Voehrinaer et al., "Gene Microarray Identification of Redox 
, • , ,L. r j_' i t^i „ T}-,-3-t- ron+~T-^i"i Rp c i r+- anr.fi or 

ano Mitochondria j. liciiiculo xnao ^^ii^^--^ >... — 

Sensitivity to Apoptosis," Proc. Natl. Acad. Sci. USA 
97 (6) : 2680-5 (2000) ) . 
25 Microarrays have also been used to determine 

abnormal gene expression in diseased tissues (see, for 
example, Alon et al . , "Broad Patterns of Gene Expression 
Revealed by Clustering Analysis of Tumor and Normal Colon 
Tissues Probed by Oligonucleotide Arrays," Proc. Natl. 

30 Acad. Sci. USA 96 (12) : 6745-50 (1999); Perou et al . , 

"Distinctive Gene Expression Patterns in Human Mammary 
Epithelial Cells and Breast Cancers, Proc. Natl. Acad. Sci. 
USA 96 (16) : 9212-7 (1999); Wang et al., "Identification of 
Genes Differentially Over-expressed in Lung Squamous Cell 

35 Carcinoma Using Combination of cDNA Subtraction and 
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r v - : ^ ^ _ ^ ^ , ; L ^ ] v - - e ^ 1T Qn ^ p agr] & 1 2> ^ 1 2 ■ 2 b 1 9 ~ 2 b \ 2 U J 'J ; ,* 

Whitnev e2 5 Z . , "Analysis of Gene rxcression in Multiple 
Sclerosis Lesions Using cDNA Kicroarrays , " Ann. iJeurci. 
•36(3): 425-3 ^1999; ) f on drug discovery screens see, for 

5 examcie, Scherf ei al . , " A, Gene Expression Database for the 
Molecular Pharmacology of Cancer," Nat. Genet. 2 4 ( 3 ) : 2 3 6- 4 4 
(2000) ) and in diagnosis to determine appropriate treatment 
strategies (see, for example, Sgroi et al. r "In vivo Gene 
Expression Profile Analysis of Human Breast Cancer 

10 Progression/' Cancer Res . 59 (22) : 5656-61 (1999)). 

T n mi cr ! ~arr ay-ba s ed cfene expression screens of 
oharmacological drug candidates upon cells , each probe 
orovides specific useful data. In particular, it should be 
appreciated that even those probes that show no change in 

15 expression are as informative as those that do change, 
serving, in essence, as negative controls. 

For example, where gene expression analysis is 
used to assess toxicity of chemical agents on cells, the 
failure of the agent to change a gene's expression level is 

20 evidence that the drug likely does not affect the pathway 

Analoaously, where gene expression analysis is used to 
assess side effects of pharmacological agents - whether in 
lead compound discovery or in subsequent screening of lead 
25 compound derivatives - the inability of the agent to alter 
a gene's expression level is evidence that the drug does 
not affect the pathway of which the gene's expressed 
protein is a part. 

r.ir\ nn / epion ^> v/-n-t -i Ar^c- m p. -r K r> H o f nr nn^nf i f nn t h^ 

30 relatedness of a first and second gene expression profile 
and for ordering the relatedness of a plurality of gene 
expression profiles. The methods so described permit 
useful information to be extracted from a greater 
percentage of the individual gene expression measurements 

35 from a microarray than methods previously used in the art. 
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Other uses of microarrays are aescribed in 
Gerhold et al., Trends Biochem. Sci. 24 ( 5 ) : 168-1 / J ;^99y) 
and Zweiger, Trends Biotechnol. 17 ( II ): 429-436 (1999); 
Scnena e: al . 

5 ThG invention particularly provides genone- 

derived single-exon probes known to be expressed in heart. 

The individual single exon probes can be provided 
in the form of substantially isolated and purified nucleic 
acid, typically, but not necessarily, in a quantity 
10 sufficient to perform a hybridization reaction. 

Such nucleic acid can be in any form directly 
hybridizable to the message that contains the probe's ORF, 
such as double stranded DNA, single-stranded DNA 
complementary to the message, single-stranded RNA 
15 complementary to the message, or chimeric DNA /RNA molecules 
so hybridizable. The nucleic acid can alternatively or 
additionally include either nonnative nucleotides, 
alternative internucleotide linkages, or both, so long as 
complementary binding can be obtained. For example, probes 
20 can include phosphorothioates , methylphosphonates , 

.ino analogs, and peptide nucleic acids (PNA), as are 
described, for example, in U.S. Patent Nos. 5,142,047; 
5,235,033; 5,166,315; 5,217,866; 5,184,444; 5,861,250. 

Usefully, however, such probes are provided in a 
25 form and quantity suitable for amplification, where the 
amplified product is thereafter to be used in the 
hybridization reactions that probe gene expression. 
Typically, such probes are provided in a form and quantity 
suitable for amplification by PGR or by other well known 
30 amplification technique. One such technique additional to 
PGR is rolling circle amplification, as is described, inter 
alia, in U.S. Patent Nos. 5,854,033 and 5,714,320 and 
international patent publications WO 97/19193 and 
WO 00/15779. As is well understood, where the probes are 
35 to be provided in a form suitable for amplification, the 

70 



WO 01/57274 PCT/i;S01/006<)<» 

range of nucleic acid analogues and/or iniernucleotide 
linkages will be constrained by che requirements =und nacur 
of tne amplification enzyme. 

Where tne probe is to be provided in form 
5 suitable for amplification, the quantity need not be 
sufficient for direct hybridization for gene expression 
analysis, and need be sufficient only to function as an 
amplification template, typically at least about 1,10 or 
10 0 pg or mc're. 

10 Each discrete ampiifiable probe can also be 

packaged with amplification primers, either in a single 
composition that comprises probe template and primers, or 
in a kit that comprises such primers separately packaged 
therefrom, hs earlier mentioned, the QRF-specific 

15 5' primers used for genomic amplification can have a first 
common sequence added thereto, and the ORF-specific 3' 
primers used for genomic amplification can have a second, 
different, common sequence added thereto, thus permitting, 
in this embodiment, the use of a single set of 5 1 and 3 1 

20 primers to amplify any one of the probes. The probe 

composition and/or kit can also include buffers, enzyme, 
etc., required to effect amplification. 

As mentioned earlier, when intended for use on a 
genome-derived single exon microarray of the present 

25 invention, the genome-derived single exon probes of the 
present invention will typically average at least about 
100, 200, 300, 400 or 500 bp in length, including (and 
typically, but not necessarily centered about) the ORF. 
Furthermore, when intended for use on a genome-derived 

in 4 v-. n ^ / * /~ \ T-r-. 4 i — ' ~v t — ■■ ~, v v~ 1 w rs-f t-Vic <ri vo q ont" t nupnhi nn . t"hp 

genome-derived single exon probes of the present invention 
will typically not contain a detectable label. 

When intended for use in solution phase 
hybridization, however - that is, for use in a 
35 hybridization reaction in which the probe is not first 
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bound to a suooort suos^rare (although toe target may 
indeed he so bound) — ler.cth constraints toot are imposed 
in mi oroarray-oasod hybridiza*:ion approaches will oe 
relaxed, and such probes v;oll typically be labeled. 

5 In such case, one only funoticnal constraint that 

dictates the minimum size of such probe is that each such 
probe must be capable of specifically identifying in a 
hybridization reaction the exon from which it is drawn. In 
theory, a probe of as little as 17 nucleotides is capable 

10 of uniquely identifying its cognate sequence in the human 
genome . For hybridization to expressed message - a subset 
of target sequence that is much reduced in complexity as 
compared to genomic sequence - even fewer nucleotides are 
required for specificity. 

15 Therefore, the probes of the present invention 

can include as few as 20, 25 or 50 bp or ORF, or more. In 
particular embodiments, the ORF sequences are given in SEQ 
ID NOS. 9,981 - 19,771, respectively, for probe SEQ ID NOS . 
1 - 9,98 0. The minimum amount of ORF required to be 

20 included in the probe of the present invention in order to 
ide specific signal in either solution phase or 
micrcarray-based hybridizations can readily be determined 
for each of ORF SEQ ID NOS. 9,981 - 19,771 individually by 
routine experimentation using standard high stringency 

25 conditions. 

Such high stringency conditions are described, 
inter alia, in Ausubel et al. and Maniatis et al. For 
microarray-based hybridization, standard high stringency 
^^^^ a +- -; n ^ o o^r^ ncofniiw h<° formamide, 5X SSC, 0.2 ucf/ul 

30 poly (OA), 0.2 ug/ul human c 0 tl DNA, and 0.5 % SDS, m a 

humid oven at 42°C overnight, followed by successive washes 
of the microarray in IX SSC, 0.2% SDS at 55°C for 5 
minutes, and then 0 . IX SSC, 0.2% SDS, at 55°C for 20 
minutes. For solution phase hybridization, standard high 

35 stringency conditions can usefully be aqueous hybridization 
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at 65°C in -X S3C. Lower stringency concisions, suicacie 
f cr cross-hvoriaization tc mRNA encoding structurally- and 
f unctionallv-reiated proteins, can usefully be the same as 
the hiqh stringency conditions but with reduction in 

5 temperature tor hybridization and washing tc rocm . 
temperature (approximately 25 C C) . 

When intended for use in solution phase 
hybridization, the maximum size of the single exon probes 
of the present invention is dictated by the proximity of 

10 other expressed exons in genomic DN A : although each single 
exon probe can include intergenic and/or intronic material 
contiguous to the ORF in the human genome, each probe of 
the present invention will include portions of only one - 
expressed exon . 

15 Thus, each single exon probe will include no more 

than about 25 kb of contiguous genomic sequence, more 
typically no more than about 20 kb of contiguous genomic 
sequence, more usually no more than about 15 kb, even more 
usually no more than about 10 kb . Usually, probes that are 

20 maximally about 5 kb will be used, more typically no more 

than about 3 kb, 

It will be appreciated that the Sequence Listing 
appended hereto presents, by convention, only that strand 
of the probe and ORF sequence that can be directly 

25 translated reading from 5 r to 3 r end. As would be well 
understood by one of skill in the art, single stranded 
probes must be complementary in sequence to the ORF as 
present in an mRNA; it is well within the skill in the art 
to determine such complementary sequence. It will further 

30 be understood tnat uouoit: st.diiucu ^ - — — 

both solution-phase hybridization and microarray-based 
hybridization if suitably denatured. 

Thus, it is an aspect of the present invention to 
provide single-stranded nucleic acid probes that have 

35 sequence complementary to those described herein above and 
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oeiov/, and double-stranded probes one strand of which has 
s e o u e n c e c orrio 1 e m e n t a r v o o t hi 6; o r o b -b s d e s o r i b e q herein. 

The probes can, duo need not, contain intergenic 
and/or introriic material that flanks the ORF, on one or 
5 octh siaes, in the same linear relationship to the ORF tnat 
the intergenic and/or intronic material bears to the ORF in 
genomic DMA. The probes do not, however, contain nucleic 
acid derived from more than one expressed ORF. 

And when intended for use in solution 

10 hybridization, the probes of the present invention can 
usefully have detectable labels. Nucleic acid labels are 
well known in the art, and include/ inter alia, radioactive 
labels, such as 3 H, 32 P, 33 P, 35 S, 125 I, 131 I; fluorescent 
labels, such as Cy3, Cy5, Cy5.5, Cy7, SYBR® 

15 Green and other labels descrioed in Kaugianoi, 

Handbook of Fluorescent Probes and Research Chemicals , 7th 
ed., Molecular Probes Inc., Eugene, OR (2000), or 
fluorescence resonance energy transfer tandem conjugates 
thereof; labels suitable for chemiluminescent and/or 

20 enhanced chemiluminescent detection; labels suitable for 
ESR and NMR detection; and labels that include one member 
of a specific binding pair, such as biotin, digoxigenin, or 
the like. 

The probes, either in quantity sufficient for 
25 hybridization or sufficient for amplification, can be 
provided in individual vials or containers. 

Alternatively, such probes can usefully be 
packaged as a plurality of such individual genome-derived 
single exon probes. 
30 When provided as a collection of plural 

individual probes, the probes are typically made available 
in amplifiable form in a spatially-addressable ordered set, 
typically one per well of a microtiter dish. Although a 96 
well microtiter plate can be used, greater efficiency is 
35 obtained using higher density arrays. 
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If, as earlier mentioned, rhe ORF~specific 
d 1 Drivers us so tor genimii amp n r ic5 tior. nao a first 
comrron sequence added thereto, and the 0?.F-specif ic 3' 
primers used for genomic amplification had a second, 
5 different, common sequence added thereto, a single set of 
5' and 3' primers can be used to amplify ail of the probes 
from the amplifiable ordered set. 

Such collections of genome-derived single exon 
probes can usefully include a plurality of probes chosen 
10 for the common attribute of expression in the human heart. 

In such defined subsets, typically at least 50, 
60, 75, 80, 85, 90 or 95% or more of the probes will be 
chosen by their expression in the defined tissue or cell 
type . 

15 The single exon probes of the present invention, 

as well as fragments of the single exon probes comprising 
selectively hybridizable portions of the probe ORF, can be 
used to obtain the full length cDNA that includes the ORF 
by (i) screening of cDNA libraries; (ii) rapid 

20 amplification of cDNA ends ("RACE"); or (iii) other 
conventional means, as are described, inter alia, in 
Ausubel et al. and Maniatis et ai. 

It is another aspect of the present invention to 
provide genome-derived single exon nucleic acid microarray 

25 useful for gene expression analysis, where the term 

"microarray" has the meaning given in the definitional 
section of this description, supra. - 

The invention particularly, provides genome- 
derived single-exon nucleic acid microarrays comprising a 

30 plurality of probes known to be expressed in human heart. 
In preferred embodiments, the present invention provides 
human genome-derived single exon microarrays comprising a 
plurality of probes drawn from the group consisting of SEQ 
ID NOS. : 1 - 9 ; 980. 

35 When used for gene expression analysis, the 
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cenome-der i ved sincrle excn microarrays provide greater 
physical informational density than do the genome-derived 
single exon microarrays that have lower peroentages of 
orobes known to be expressed commonly in the tested tissue. 
5 At a fixed probe density, for example, a given microarray 
surface area of the defined subset genome-derived single 
exon micro-array can yield a greater number of expression 
measurements. Alternatively, at a given probe density, the 
same number of expression measurements can be obtained from 

10 a smaller substrate surface area. Alternatively, at a 

fixed probe density and fixed surface area, probes can be 
provided redundantly, providing greater reliability in 
signal measurement for any given probe. Furthermore, with 
a higher percentage of probes known to be expressed in the 

15 assayed tissue, the dynamic range of the detection means 

can be adjusted to reveal finer levels discrimination among 
the levels of expression. 

Although particularly described with respect to 
their utility as probes of gene expression, particularly as 

20 probes to be included on a genome-derived single exon 

microarray, each of the nucleic acids having SEQ ID NOS . : 1 
- 9,980 contains an open-reading frame, set forth 
respectively in SEQ ID NOS.: 9,981 - 19,771, that encodes a 
protein domain. Thus, each of SEQ ID NOS. 1 - 9,980 can be 

25 used, or that portion thereof in SEQ ID NOS. 9,981 - 19,771 
used, to express a protein domain by standard in vitro 
recombinant techniques. See Ausubel et al. and Maniatis at 
ai . 

Additionally, kits are available commercially 
30 that readily permit such nucleic acids to be expressed as 
protein in bacterial cells, insect cells, or mammalian 
cells, as desired (e.g., HAT™ Protein Expression & 
Purification System, ClonTech Laboratories, Palo Alto, CA; 
Adeno-X™ Expression System, ClonTech Laboratories, Palo 
35 Alto, CA; Protein Fusion & Purification ( pMAL™ ) System, New 
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England B i o I a b s , 3 e v e r 1 e y , MA ) 

u r r h ^ r m o e s h r "- r 4 ~ ^ r o e o t i d e c^n b e zi p e m i c 3. 1 1 ^ 7 
synthesized using commercial peptide synthesizing equipment 
ar : d well known techniques. Procedures are described, inter 
5 a i i a , in Chan ct a_Z. feds.}, Em o c S o i d ^ t a s e P ^ o t i d e 

S ynthesis: A Pr a ctical Approac h (Practical Approach Series, 
(Paper)), Oxford Univ. Press (March 2000) (ISBN: 
0199637245); Jones, Amino Acid and Peptide Synthesis 
(Oxford Chemistry Primers, No 7) , Oxford Univ. Press 
10 (August 1992) (ISBN: 0198556683); and Bodanszky, Principles 

.-I Donf i Ho Q^7-nf-*'n i =iC'ip ( r- > r t y~> rr a T.^V^r^v-z^-f. v-\7 \ Q-^vi nrfD>" Uarl arr 

(December 1993) (ISBN: 0387564314). 

It is, therefore, another aspect of the invention 
to provide peptides comprising an amino acid sequence 

15 translated from SEQ ID K03 . : 9,981 - 19,771. Such amino 
acid sequences are set out in SEQ ID NOS: 19,772 - 29,119. 
Any such recombinantly-expressed or synthesized peptide of 
at least 8, and preferably at least about 15, amino acids, 
can be conjugated to a carrier protein and used to generate 

20 antibody that recognizes the peptide. Thus, it is a 

have at least 8 , preferably at least 15, consecutive amino 
acids . 



25 The following examples are offered by way of 

illustration and not by way of limitation. 

EXAMPLE 1 

ricjjaiauxuii ui oiny ic DAUn nx^iuaiiaj/o iium wKl o t icuxl-lcu 

30 in Human Genomic Sequence 

Bioinf ormatics Results 

All human BAG sequences in fewer than 10 pieces 
that had been accessioned in a five month period 
35 immediately preceding this study were downloaded from 
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; -2200 clones, totaling -350 
;elv 10% of one human genome. 
;itive elements using one 



o: 



open 



program CR03S_M7>TCH , the sequence was analyze 

v h-^t -n^oo Qo^pr^*-^ npne flndinc proqrams. 

The three programs predict genes using independent 
algorithmic methods developed on independent training sets: 
GRAIL uses a neural network, GENE FINDER uses a hidden 
Markoff model, and DICTION r a program proprietary to 
Genetics Institute, operates according to a different 

i --.4-.:^ rnonUc o -f a ii rhrpo nroarams were used to 

create a prediction matrix across the segment of genomic 
DNA. 

The three gene finding programs yielded a range 
of results. GRAIL identified the greatest percentage of 
genomic sequence as putative coding region, 2% of the data 
analyzed. GENEFINDER was second, calling 1%, and DICTION 
yielded the least putative coding region, with 0.8% of 
genomic sequence called as coding region. 

The consensus data were as follows. GRAIL and 
GENEFINDER agreed on 0.7% of genomic sequence, GRAIL and 
DICTION agreed on 0.5% of genomic sequence, and the three 
programs together agreed on 0.25% of the data analyzed. 
That is, 0.25% of the genomic sequence was identified by 
all three of the programs as containing putative coding 
region . 

ORFs predicted by any two of the three programs 
("consensus ORFs") were assorted into "gene bins" using two 
criteria: (i) any 7 consecutive exons within a 25 kb window 
were placed together in a bin as likely contributing to a 
single gene, and (2) all ORFs within a 25 kb window were 
placed together in a bin as likely contributing to a single 
gene if fewer than 7 exons were found within the 25 kb 
window . 



35 
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The largest ORF from each gene bin that did not 
span repetitive sequence was then chosen for amplification, 
as were all consensus ORFs longer than 500 bp. This method 
5 approximated one exon per gene; however, a number of genes 
were found to be represented by multiple elements. 

Previously, we had determined that DNA fragments 
fewer than 250 bp in length do not bind well to the ammo- 
modified glass surface of the slides used as support 
10 substrate for construction of microarrays; therefore, 
amplicons were designed in the present experiments to 
approximate 500 bp in length. 

Accordingly, after selecting the largest ORF per 
gene bin, a 500 bp fragment of sequence centered on the ORF 
15 was passed to the primer picking software, PRIMER3 
(available online for use at 

http://www~genome.wi.mit.edu/cgi-bin/primer/ ). A first 
additional sequence was commonly added to each ORF-unique 
5' primer, and a second, different, additional sequence was 

20 commonly added to each ORF-unique 3 f primer, to permit 

subsequent reamplif ication of the amplicon using a single 
set of "universal" 5' and 3' primers, thus immortalizing 
the amplicon. The addition of universal priming sequences 
also facilitates sequence verification, and can be used to 

25 add a cloning site should some ORFs be found to warrant 
further study. 

The ORFs were then PCR amplified from genomic 
DNA, verified on agarose gels, and sequenced using the 
universal primers to validate the identity of the amplicon 

Primers were supplied by Operon Technologies 
(Alameda, CA) . PCR amplification was performed by standard 
techniques using human genomic DNA (Ciontech, Palo Alto, 
CA) as template. Each PCR product was verified by SYBR® 
35 green (Molecular Probes, Inc., Eugene, OR) staining of 
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agarose gels, with subsequent imaging by Flucrimager 
'Molecular Dynamics, Inc. , Sunnyvale , CA) . PCR 
amplification was classified as successful if a single band 
appeared. 

5 Tne success race for amplifying ORFs of interest 

directly from genomic DNA using PGR was approximately 75%. 
FIG. 5 graphs the distribution of predicted ORF (exon) 
length and distribution of amplified PGR products, with ORF 
length shown in red and PCR product length shown in blue 

10 (which may appear black in the figure) . Although the range 
of ORF sizes is readily seen to extend to beyond 900 bp, 
che mean predicted exon size was only 229 bp, with a median 
size of 150 bp (n=9498). With an average amplicon size of 
475 ± 25 bp, approximately 50% of the average PCR 

15 amplification product contained predicted coding region, 
with the remaining 50% of the amplicon containing either 
intron, intergenic sequence, or both. 

Using a strategy predicated on amplifying about 
500 bp, it was found that long exons had a higher PCR 

20 failure rate. To address this, the bioinf ormatics process 
was adjusted to amplify 1000, 1500 or 2000 bp fragments 
from exons larger than 500 bp. This improved the rate of 
successful amplification of exons exceeding 500 bp, 
constituting about 9.2% of the exons predicted by the gene 

25 finding algorithms. 

Approximately 75% of the probes disposed on the 
array (90% of those that successfully PCR amplified) were 
sequence-verified by sequencing in both the forward and 
reverse direction using MegaBACE sequencer (Molecular 

standard protocols . 

Some genomic clones (BACs) yielded very poor PCR 
and sequencing results. The reasons for this are unclear, 
but may be related to the quality of early draft sequence 
35 or the inclusion of vector and host contamination in some 
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subT.i. 1 t ed 3 ecru ence data . 

Although the inrronic and in:.eraer:i c n;a:-rial 
flanking coding regions could theoretically interfere with 
hybridization during microarray experiments, subsequent 
5 empirical results demonstrated that: differential expression 
ratios were not significantly affeoted by the presence of 
noncoding sequence. The variation in axon size was 
similarly found not to affect differential expression 
ratios significantly; however, variation in exon size v/as 
10 observed to affect the absolute signal intensity (data not 
shown) . 

The 350 MB of genomic DNA was, by the above- 
described process, reduced to 9750 discrete probes, which 
were spotted in duplicate onto glass slides using 

15 commercially available instrumentation (MicroArray Genii 
Spotter and/or MicroArray Genlll Spotter, Molecular 
Dynamics, Inc., Sunnyvale, CA) . Each slide additionally 
included either 16 or 32 E. coli genes, the average 
hybridization signal of which was used as a measure of 

20 background biological noise. 

Each of the probe sequences was BLASTed against 
the human EST data set, the NR data set, and SwissProt 
GenBank (May 7, 1999 release 2.0.9). 

One third of the probe sequences (as amplified) 

25 produced an exact match (BLAST Expect ("E") values less 

than 1 e" 100 ) to either an EST (20% of sequences) or a known 
mRNA (13% of sequences). A further 22% of the probe 
sequences showed some homology to a known EST or mRNA 
(BLAST E values from 1 e -3 to 1 e~") . The remaining 4 5% of 

30 the probe sequences showed no significant sequence homology 
to any expressed, or potentially expressed, sequences 
present in public databases. 

All of the probe sequences (as amplified) were 
then analyzed for protein similarities with the SwissProt 

35 database using 3LA3TX, Gish et al. f Nature Genet. 3:266 
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notional breakdowns of the 2/3 of 
GCJGU3 to Known sequences are 



Table 1 



tunction or Predicted UKrs As Deducea From Comparative 
Sequence Analysis 

Total V6 chip Ml chip Function Predicted from 

Comparative Sequence 
Analysis 


211 


96 


115 


Receptor 


120 


43 


77 


Zinc Finger 


30 


11 


19 


Homeobox 


25 


0 


-i /- 
_L D 


Transcription Factor 


17 


11 


7 


Transcription 


118 


57 


61 


Structural 


95 


39 


56 


Kinase 


36 


18 


18 


Phosphatase 


83 


31 


52 


Ribosomal 


45 


19 


26 


Transport 


21 


-L / 


14 


Growth Factor 


17 


12 


5 


Cytochrome 


50 


33 


17 


Channel 



As can be seen, the two most common types of 
genes were transcription factors and receptors, making up 
2.2% and 1.8% of the arrayed elements, respectively. 

10 

EXAMPLE 2 

Gene Expression Measurements From Genome-Derived Single 
Exon Microarrays 
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Tne predicted 



oresenned in Table 1. 
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or spared accordinci to Exarncle 1 were hybrid! zed in a series 

5 7 v 3~j sbsl 3d oD N A z ' 7 n c h e s " z e d f r o m rn e s s a c e d r a w n 

individually from each of brain, heart, liver, fetal liver, 
placenta, lung, bone marrow, HeLa, 5T 474, or K3L 10 D 
cells, and ;2) Cyc-labeled cDNA prepared from message 
pooled from ail ten tissues and cell types, as a control in 
10 each of the measurements. Hybrid! zaci :n and. scanning were 
carrier! out usinc standard protocols and Molecular Dynamics 
e guipment . 

Briefly, mRNA samples were bought from commercial 
sources (Ciontech, Palo A^to, OA and Amersham Pharmacia 

15 Biotech ( AP3 ) ) . Cy3-dCTP and Cy5-dC7P (both from APB) were 
incorporated during separate reverse transcriptions of -1 pg 
of polyA 4- mRNA performed using 1 ug oligo (dT) 12-18 primer 
and 2 ug random 9mer primers as follows. After heating to 
70°C, ohe RNA: primer mixture was snap cooled on ice. After 

20 snap cooling on ice, added oo the RNA oo the stated final 
concentration was: IX Superscript 11 buffer, 0.01 M DTT , 
lOOuM dAT ? , 100 pM cGTP, 100 pM dTTP, 50 uM dCTP, 50 pM 
Cy3-dCTP or Cy5-dCTP 50 uM, and 200 U Superscript 21 
enzyme. The reaction was incubated for 2 hours at 42°C. 

25 After 2 hours, the first strand cDKA was isolated by adding 
1 U Ribonuclease H, and incubating for 30 minutes at 37°C. 
The reaction was then purified using a Qiagen PCR cleanup 
column, increasing the number of ethanoi washes to 5. 
Probe was eluted using 10 mM Tris pK 8.5. 

30 Using a spectrophotometer, probes were measured 

for dye incorporation. Volumes of both Cy3 and Cy5 cDNA 
corresponding to 50 pmoles of each dye were then dried in a 
Speedvac, resuspended in 30 pi hybridization solution 
containing 50% formamide, 5X SSC, 0.2 pg/pl poly(dA), 0.2 

35 ug/ul human c G tl DNA, and 0.5 % SDS. 
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H v brldizations w e r e c a r it i e d c u t u n ci e r a 
COV6TS I io , v, 7 i t h t h e a r r a v p 2 a c e ci. in a h urn id ovsr. a ^ ^ 2 0 C 
overnight. Before scanning, slides were washed in IX S3C, 
0.2% SDS at 55°C for 5 minutes, followed by 0 . IX SSC, 0.2% 
5 3DS, at 55°C for 2G ninutes. Slides were briefly dipped in 
water and dried thoroughly under a gentle stream of 
nitrogen . 

Slides were scanned using a Molecular Dynamics 
Gen3 scanner, as described. Schena (ed.), Microarray 
10 Biochip: Tools and Technology , Eaton Publishing 
Company /BioTechniques Books Division (2000) (ISBN: 
1881299376) . 

Although the use of pooled cDNA as a reference 
permitted the survey of a large number of tissues, it 
15 attenuates the measurement of relative gene expression, 

since every highly expressed gene in the tissue/cell type- 
specific fluorescence channel will be present to a level of 
at least 10% in the control channel. Because of this fact, 
both signal and expression ratios {the latter hereinafter, 
20 "expression" or "relative expression") for each probe were 
normal"' zed using the average ratio or average signal, 
respectively, as measured across the whole slide. 

Data were accepted for further analysis only when 
signal was at least three times greater than biological 
25 noise, the latter defined by the average signal produced by 
the E. coli control genes. 

The relative expression signal for these probes 
was then plotted as function of tissue or ceil type, and is 

nrponnf -it-i TTT (~L 

30 FIG. 6 shows the distribution of expression 

across a panel of ten tissues. The graph shows. the number 
of sequence-verified products that were either not 
expressed ("0"), expressed in one or more but not all 
tested tissues ("1" - "9"), and expressed in all tissues 

35 tested ("10") . 
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products), 2353 (51%) were expressed in at least one tissue 
or cell type. Of The gene elements showing significant 
5 signal - where expression was scored as " significant " if 
the normalized Cy3 signal was greater than 1, representing 
signal 5-fold over biological noise (0.2) - 39% (991) were 
expressed in all 10 tissues. The next most common class 
(15%) consisted of gene elements expressed in only a single 
10 tissue. 

T 1 K £Ts rYO^^C 1 -.r v~\ ir- r-\ t~* r> i~\ i' >-n -> r-, -J ^ ^ 4- -!,-,„,., _ 

further analyzed, and the results of the analyses are 
compiled in FIG. 7. 

FIG, 7A is a matrix presenting the expression of 

15 all verified sequences that showed expression greater than 
3 in at least one tissue. Each clone is represented by a 
column in the matrix. Each of the 10 tissues assayed is 
represented by a separate row in the matrix, and relative 
expression of a clone in that tissue is indicated at the 

20 respective node by intensity of green shading, with the 
in Lens i L.y T legend shown in panel B. The too row of the 
matrix ("EST Hit") contains "bioinf ormatic" rather than 
"physical" expression data - that is, presents the results 
returned by query of EST, NR and SwissProt databases using 

25 the probe sequence. The legend for "bioinf ormatic 
expression" (i.e., degree of homology returned) is 
presented in panel C. Briefly, white is known, black is 
novel, with gray depicting nonidentical with significant 
homology (white: E values < ie-100; gray: E values from le- 

30 05 to le-99; black: E values > le-05) . 

As FIG. 7 readily shows, heart and brain were 
demonstrated to have the greatest numbers of genes that 
were shown to be uniquely expressed in the respective 
tissue. In brain, 200 uniquely expressed genes were 

35 identified; in heart, 150. The remaining tissues gave the 
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following figures for uniauoly exoressed aer.es : liver, 100; 
lure, 70; feral liver, 150; bone narrow, 75; olacenta, 100; 
Heia, 50; HBL , 100; ana 3T474, 50. 

It was further observed that there were many more 
5 "novel" genes among those that were up-regulated in only 
one tissue, as compared with those that were down-regulated 
in only one tissue. In fact, it was found that ORFs whose 
expression was measurable in only a single of the tested 
tissues were represented in sequencing databases at a rate 

10 of only 11%, whereas 36% of the ORFs whose expression was 
measurable in 9 of the tissues were present in public 
databases. As for those ORFs expressed in all ten tissues, 
fully 45% were present in existing expressed sequence 
databases. These results are not unexpected, since genes 

15 expressed in a greater number of tissues have a higher 

likelihood of being, and thus of having been, discovered by 
EST approaches. 

C omparison of Signal from Known and Unknown Genes 
20 The normalized signal of the genes found to have 

high homology to genes present in the GenBank human EST 
database were compared to the normalized signal of those 
genes not found in the GenBank human EST database. The 
data are shown in FIG. 8. 
25 FIG . 8 shows the normalized Cy3 signal intensity 

for all sequence-verified products with a BLAST Expect 
("E") value of greater than le-30 (designated "unknown") 
upon query of existing EST, NR and SwissProt databases, and 

Q Vi r^i w q hi no -hHo nr\ rrn zzl -\ *7 arA C \? *3. c i nr^ n 1 -it-i+-ov\<^-i4-w -f *-\ -v* 3 1 1 

wiA^rvO _l- i_* t-n^ U^i-iUa j — i_ \^ y o J-yiid J. XliuCll Ji. L j L \j a. ct j_ J_ 

30 sequence-verified products with a BLAST Expect value of 

less than le-30 ("known"). Note that biological background 
noise has an averaged normalized Cy3 signal intensity of 
0.2. 

As expected, the most highly expressed of the 
35 ORFs were "known" genes. This is not surprising, since 
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very hi ah signal intensity correlates with very commonly- 
expressed genes, which have a higher likelihood of being 
found by EST sequence. 

However, a significant point is that a large 

5 number of even the high expressers were "unknown' 1 . Since 
the genomic approach used to identify genes and to confirm 
their expression does not bias exons toward either the 3 ' 
or S 1 end of a gene, many of these high expression genes 
will nor have been detected in an end-sequenced cDNA 

10 library. 

The significant point is that presence of the 
gene in an EST database is not a prerequisite for 
incorporation into a genome-derived microarray, and 
furtner, that arraying such "unknown" exons can help to 
15 assign function to as-yet undiscovered genes, 



Verification of Gene Expression 

To ascertain the validity of the approach 
described above to identify genes from raw genomic 

20 sequence, expression of two of the probes was assayed using 
reverse transcriptase polymerase chain reaction (RT PCR) 
and northern blot analysis. 

Two microarray probes were selected on the basis 
of exon size, prior sequencing success, and tissue-specific 

25 gene expression patterns as measured by the microarray 
experiments. The primers originally used to amplify the 
two respective ORFs from genomic DNA were used in RT PCR 
against a panel of tissue-specific cDNAs (Rapid-Scan gene 
expression panel 24 human cDNAs) (OriGene Technologies, 

ju inc. , r\Oo<.vj_xle , nuj . 

Sequence AL079300_1 was shown by microarray 
hybridization to be present in cardiac tissue, and sequence 
AL031734_1 was shown by microarray experiment to be present 
in placental tissue (data not shown) . RT-PCR on these two 

35 sequences confirmed the tissue-specific gene expression as 
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£ c r ■■ r^^ct Iv sized ? C P. product f r o t. t h e r 6' s p e c 1 1 v 6 1 1 5 5 u e 
type cDNAs . 

Clearly, all micrcarray results cannot, and 

5 indeed should not/ be ccnfirmed by independent assay 

methods, or the high throughput, highly parallel advantages 
of microarray hybridization assays will be lost. However, 
in addition to the two RT-PCR results presented above, the 
observation that 1/3 of the arrayed genes exist in 

10 expression databases provides powerful confirmation of the 
power of our methodology — which combines bioinf ormat i c 
prediction with expression confirmation using genome- 
derived single exon microarrays — to identify novel genes 
1 rorTi raw oenomrc data . 

15 To verify that the approach further provides 

correct characterization of the expression patterns of the 
identified genes, a detailed analysis was performed of the 
microarrayed sequences that showed high signal in brain. 

For this latter analysis, sequences that showed 

20 high (normalized) signal in brain, but which showed very 
low (normalized) signal (less than 0.5, determined to be 
biological noise) in all other tissues, were further 
studied. There were 82 sequences that fit these criteria, 
approximately 2% of the arrayed elements. The 10 sequences 

25 showing the highest signal in brain in microarray 

hybridizations are detailed in Table 2, along with assigned 
function, if known or reasonably predicted. 

x clD-lO ^ 

Function of the Most Highly 
Expressed Genes Expressed Only in Brain 
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Micrcarray Mortal Express! Homcloay Gene Function 
Secruence ized ori Ratio "to EST as desc^bed by 
^ame Signal present GenBank 

in 

GenBank 


z\ d n n n 9 1 i — i 

nir UU UZ 1 / i 


^ a 


1 -L "7 "7 
M~ / . / 


High 


S-100 protein, 
b-chain, Ca 24 
binding protein 
expressed in 
central nervous 
system 


AP000047-1 


2.3 




High 


Unknown 
Function 


7\ a A /T c / p A 
rv^- U U \J O 1 0 — z> 






7 T J __ 1_ 

man 


Similar to 
mouse membrane 
glyco-protein 
M6, expressed 
in central 
nervous system 


UU / Z 4 3 - 3 


1 . 3 




T T 1 1_ 

High 


Similar to 
amphiphysin, a : 
synaptic 
vesicle- 
associated 
nrni"p"i n R^*P 91 


L44140-4 


1.2 


+ 2.0 


High 


Endothelial 
actin-binding 
protein found 

f ilamin 
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1 r. ^ ^ o r ,r Q c _ G 


... . 2 




High 


PP2A, neuronal / 
down regulates 

3C1I VatSu 

protein kinases 


AL031657-1 


1 . 2 


^3.0 


High 


unknown 
f unction/ 
Contains the 
anhyrin motif, 
a common 
protein 
sequence motif 


AC009266-2 


I . i 


+ 3.7 


Low 


Low homology to 
the 

Synaptotagmin I 
protein in 
rat/present at 
low levels 
throughout rat 
brain 


Mr 00 u u o o- jl 


-i n 
JL . U 


+2.7 


Low 


Unknown, very 
poor homology 
to collagen 


AC004689-3 


1.0 




High 


Protein 
Phosphatase 
PP2A, neuronal/ 
downregulates 
activated 
protein kinases 



Of the ten sequences studied by these latter 
confirmatory approaches, eight were previously known. Of 
these eight, six had previously been reported to be 
important in the central nervous system or brain. The exon 
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O 6 Ti 6 6 ' ~i COdl ti g cT. bl JUd C 5. " O 1. H Q 1 Ti g protein , IT -3 p C r t 6 O i n 

the literature to be highly and uniquely expressed in the 
central nervous system. Heizmann, Neurochem. Res. 9:1097 
5 (1997) . 

A number of the brain-specific probe sequences 
(including ACQ0654 8-9, AC0Q9266-2) did not have homology to 
any known human cDNAs in GenBank but did show homology to 
rat and mouse cDNAs. Sequences AC0Q4689-9 and AC004689-3 

10 were both found to be phosphatases present in neurons 
(Millwarti et ai., Trends Biochem. Sci. 24 (5) : 186-191 
(1999)). Two microarray sequences, AP000047-I and 
AP000086-1 have unknown function, with AP000036-1 being 
absent from GenBank. Functionality can now be narrowed 

15 down to a role in the central nervous system for both of 
these genes, showing the power of designing microarrays in 
this fashion. 

Next, the function of the chip sequences with the 
highest (normalized) signal intensity in brain, regardless 

20 of expression in other tissues, was assessed. In this 
latter analysis, we found expression of many more common 
genes, since the sequences were not limited to those 
expressed only in brain. For example, looking at the 20 
highest signal intensity spots in brain, 4 were similar to 

25 tubulin (AC00807905; AF146191-2; AC007664-4; AF14191-2), 2 
were similar to actin (AL035701-2; AL034402-1), and 6 were 
found to be homologous to glyceraidehyde-3-phosphate 
dehydrogenase (GAPDH) (AL035604-1 ; Z86090-1; AC005064-L, 
AC006064-K; AC035604-3; AC006064-L) . These genes are often 

30 used as controls or housekeeping genes in microarray 
experiments of all types. 

Other interesting genes highly expressed in brain 
were a ferritin heavy chain protein, which is reported in 
the literature to be found in brain and liver (Joshi et 

35 al. r J- Neurol. Sci. 134 (Suppl) : 52-56 (1995)), a result 
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duplicated with the array. Other highly expressed chip 
sect e r r* e s i n c 1 u c so a t r a n s r a t ~. o n e r c r c a t ± o n r a c t o r _ ; _J 
[AC0G7:>54-4 ) , a DEAD-box homolog (AL023304-4 ) , and a Y- 
chromosorae RNA-binding motif (Chai et al., Genomics 
5 4 9 (2 ) : .L 3 3-8 9 ; 1998 } } (AC007320-I ) . A low homology analog 
(AP00123-1/2) to a gene, DSCR1, thought to be involved in 
trisomy 21 (Down's syndrome) , showed high expression in 
both brain and heart, in agreement with the literature 
(Fuentes et al., Mol. Genet. 4 (10) : 1935-44 (1995)}. 

10 As a further validation of the approach, we 

selected the BAC AC006064 to be included on the array. 
This BAC was known to contain the GAPDH gene, and thus 
could be used as a control for the ORF selection process. 
The gene findincr and ex on selection algorithms resulted in 

15 choosing 25 exons from BAC AC006064 for spotting onto the 
array, of which four were drawn from the GAPDH gene. Table 
3 shows the comparison of the average expression ratio for 
the 4 exons from BAC006064 compared with the average 
expression ratio for 5 different dilutions of a 

20 commercially available GAPDH cDNA (Clontech) . 



Table 3 



Comparison of Expression Ratio, for each 
tissue, of GAPDH 




AC006064 (n = 4) 


Control ( n = 5) 


Bone Marrow 


-1.81 ± 0.11 


-1.85 ± 0.08 


Brain 


-1.41 ± 0.11 


-1.17 ± 0.05 


BT474 


1.85 + 0.09 


1.66 + 0.12 


Fetal Liver 


-1.62 + 0.07 


-1.41 + 0.05 


HBL100 


1.32 ± 0.05 


2.64 ± 0.12 


Heart 


1.16 ± 0.09 


1.56 ± 0.10 


HeLa 


1.11 ±0.06 


1.30 ± 0.15 


Liver 


-1.62 ± 0.22 


-2.07 ± 
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Each tissue shows excellent: agreement between the 

experimentally chosen exons and the central, again 

5 -demons tratinq the validity of the nrp.qpnt- pvnn nininn 

j ^ - . _ — ^ . ~ - * ^ 

approach. In addition, the data also show the variability 

of expression of GAPDH within tissues, calling into 

question its classification as a housekeeping gene and 

utixity as a housekeeping control in micrcarray 

10 experiments . 

EXAMPLE 3 

Representation of Sequence and Expression Data as a 
"Mondrian" 

15 

For each genomic clone processed for microarray 
as above-described, a plethora of information was 
accumulated, including full clone sequence, probe sequence 
within the clone, results of each of the three gene finding 

20 programs, EST intormation associated with the probe 
sequences, and microarray signal and expression for 
multiple tissues, challenging our ability to display the 
information. 

Accordingly, we devised a new tool for visual 

25 display of the sequence with its attendant annotation 
which, in deference to its visual similarity to the 
paintings of Piet Mondrian, is hereinafter termed a 
"Mondrian". FIGS. 3 and 4 present the key to the 
information presented on a Mondrian. 

30 FIG. 9 presents a Mondrian of BAG AC003172 (bases 

25,000 to 130,000 shown), containing the carbamyl phosphate 
synthetase gene ( AF154 8 30 . I ) . Purple background within the 
region shown as field 81 in FIG. 3 indicates all 37 known 
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exons for onis gene. 



11 o u l . 



27 of the known exons (73%), GENEFINDER successfully 
identified 37 of one known exons (100%), while DICTION 
5 identified 7 of zhe known exons (19%). 

Seven of the predicted exons were selected for 
physical assay, of which 5 successfully amplified by PCR 
and were sequenced. These five exons were all found to be 
from the same gene, the carbamyl phosphate synthetase gene 

10 (AF154830. 1) . 

The five exons were arrayed, and gene expression 
measured across 10 tissues. As is readily seen in the 
Mondrian, the five chip sequences on the array show 
identical expression patterns, elegantly demonstrating the 

15 reproducibility of the system. 

FIG. 10 is a Mondrian of BAC AL049839. We 
selected 12 exons from this BAG, of which 10 successfully 
sequenced, which were found to form between 5 and 6 genes. 
Interestingly, 4 of the genes on this BAC are protease 

20 inhibitors. Again, these data elegantly show that exons 
selected from the same gene show the same expression 
patterns, depicted below the red line. From this figure, 
it is clear that our ability to find known genes is very 
good. A novel gene is also found from 86.6 kb to 88.6 kb, 

25 upon which all the exon finding programs agree. We are 
confident we have two exons from a single gene since they 
show the same expression patterns and the exons are 
proximal to each other. Backgrounds in the following 
colors indicate a known gene (top to bottom) : 

30 red = kaliistatin protease inhibitor (P29622); 

purple = plasma serine protease inhibitor (P05154) ; 
turquoise = otl ant i-chymot ryps in (P01011); mauve = 40S 
ribosomal protein (P08865) . Note that chip sequence 8 and 
12 did not sequence verify. 
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Genome-Derived Single Sxon Probes Useful For Measuring 
Hunan Gene Expression 

5 

The protocols set forth in Examples 1 and 2, 
supra f were applied to additional human genomic sequence as 
it became newly available in GenBank to identify unique 
exons in the human genome that could be shown to be 

10 expressed at significant levels in heart tissue. 

These unique exons are within longer probe 
sequences. Each probe was completely sequenced on both 
strands prior to its use on a genome-derived single exon 
microarray; sequencing confirms the exact chemical 

15 structure of each probe. An added benefio of sequencing is 
that it placed us in possession of a set of single base- 
incremented fragments of the sequenced nucleic acid, 
starting from the sequencing primer 3 f OH. (Since the 
single exon probes were first obtained by PCR amplification 

20 from genomic DNA, we were of course additionally in 

possession of an even larger set of single base incremented 
fragments of each of the 9,980 single exon probes, each 
fragment corresponding to an extension product from one of 
the two amplification primers.) 

25 The structures of the 9,980 unique single exon 

probes are clearly presented in the Sequence Listing as SEQ 
ID Nos. : 1 - 9,980. The 16 nt 5' primer sequence and 16 nt 
3' primer sequence present on the amplicon are not included 
m the sequence listing. The sequences of the exons 

30 present within each of these probes is presented in the 
Sequence Listing as SEQ ID Nos.: 9,981 - 19,771, 
respectively. It will be noted that some amplicons have 
more than one exon, some exons are contained in more than 
one amplicon. 

35 As detailed in Example 2, expression was 

95 
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demonstrated by disposing -.be amplicons as single exon 
, r ^ ^ „ -.^ v-, 1 ^ .-, p r-y ~ a pi ^ -r o p r p v c a +- n c=. ^ C' O 1 f C nt 1 n O t W O ~" 

cclor fluorescent hybrid! za t ion analysis; significant 
expression is based on a statistical confidence that the 

5 signal is significantly greater than negative biological 
control spots. The negative biological control is formed 
from spotted DNA sequences from a different species. Here, 
32 sequences from E.Coli were spotted in duplicate to give 
i total of 64 spots. 

10 For each hybridisation (each slide, each colour) 

IRfciO-Lcin Vdiuc uj. Liic o-Lynao. &X-l- wj- i-nt- vJ ir / ^ ^ 

determined. The normalised signal value is the arithmetic 

mean of the signal from duplicate spots divided by the 

population median. 
15 Control spots are eliminated if there is more 

that a five-fold difference between each one of the 

duplicate spots raw signals. 

The median of the signal from the remaining 

control spots is calculated and all subsequent calculations 
20 are done with normalised signals. 

Control spots having a signal of greater than 

rfieuidll t ^-S ( Lilt! vaxue £ . i xuUyniy -la. uj-ilLt:b 

observed standard deviation of control spot populations) 
are eliminated. Spots with such high signals are considered 
25 to be ''outliers" . 

The mean and standard deviation of the modified 
control spot populations are calculated. 

The mean + 3x the standard deviation (mean + 
(3*SD)) is used as the signal threshold qualifier for that 
30 particular hybridisation. Thus, individual thresholds are 
determined for each channel and each hybridisation. 

This means that, assuming that the data is 
distributed normally, there is a 99% confidence that any 
signal exceeding the threshold is significant. 
35 The probes and their expression data are 
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preserved in Table 4, 5Cl forth respectively in Example 5. 

— , n „ z ^ ~ ^ ~ - ^ -i- V-, v-^^X.--^ ^ n I- ' c 

significantly expressed in che human heart and thus 

presents che sunset of probes that was recognized to be 

5 useful for measuring expression of their cognate genes in 

human heart tissue. 

Thp sfiniiencfi of each of the exon orobes 
— - — j. j. 

identified by SEQ ID NOS.: 9,931 - 19,771 was individually 
used as a BLAST (or, for SWISSPROT, BLASTX) query to 
10 identify che most similar sequence in each of dbEST, 

SwissProt (BLASTX) , and NR divisions of GenBank. Because 
tne query sequences are themselves derived from genomic 
sequence in GenBank, only nongenomic hits from NR were 
s co r e d 

15 The smallest in value of the BLAST (or BLASTX) 

expect ("E") scores for each query sequence across the 
three database divisions was used as a measure of the 
"expression novelty" of the probe 1 s ORF. Table 4 is sorted 
in descending order based on this measure, reported as 

20 "Most Similar (top) Hit BLAST E Value". Those sequences for 
which no "Hit E Value" is listed are those exons which were 
found to have no similar sequences. 

As sorted, Table 4 thus lists its respective 
probes (by "AMPLICON SEQ ID NO.:" and additionally by the 

25 SEQ ID NO: . of the exon contained within the probe: "EXON 
SEQ ID NO.: 11 ) from least similar to sequences known to be 
expressed (i.e., highest BLAST E value), at the beginning 
of the table, to most similar to sequences known to be 
expressed (i.e., lowest BLAST E value), at the bottom of 

30 the table. 

Table 4 further provides, for each listed probe, 
the accession number of the database sequence that yielded 
the "Most Similar (top) Hit BLAST E Value", along with the 
name of the database in which the database sequence is ■ 
35 found ("Top Hit Database Source"). 
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v- j ' 7 v- t ^ ^ q ^ f", hg - c rrr, - no.j f o T " ""ho t r o jo ^ and g x ^ n n u c 1 ^ C 1 ^ r d e 
sequences. These are set out as PEPTIDE SEQ ID lOS.:. The 
5 o e t 1 x o e S6CL6 n z e s r o r a gi v e a e x on are p r e d i c t e o as 

follows: Since each chip exon is a consensus sequence drawn 
from predictions from various exon finding programs (i.e. 
Grail, GeneFinder and GenSoan), the multiple initial ORFs 
are first determined in a uniform way according to each 

10 prediction. In particular, the reading frame for predicting 
the first amino acid in the peptide sequence always starts 
with the first base of any codon and ends with the last 
base of non-termination codon. Next, for each strand of the 
exon, initial ORFs are merged into one or more frnal ORFs 

15 in an exhaustive process based on the following criteria: 
1) the merging ORFs must be overlapping, and 2) the merging 
ORFs must be in the same frame. 

The Sequence Listing, which is a superset of all 
of the data presented in Table 4, further includes, for 

20 each probe, the most similar hit, with accession number and 
BLAST E value, from the each of the three queried 
databases . 

Table 4 further lists, for each probe, a portion 
of the descriptor for the top hit ("Top Hit Descriptor") as 

25 provided in the sequence database. For those ORFs that are 
similar in sequence, but nonidentical to known sequences 
(e.g., those with 3LAST E values between about le-05 and 
le-100) , the descriptor reveals the likely function of the 
protein encoded by the probe's ORF. 

30 Using BLAST E value cutoffs of le-05 (i.e., 1 x 

10" 5 ) and le-100 (i.e., 1 x 10" 1C0 ) as evidence of similarity 
to sequences known to be expressed, is of course arbitrary: 
in Example 2, supra, a BLAST E value of le-30 was used as 
the boundary when only two classes were to be defined for 

35 analysis (unknown, >ie-30; known <le-30) (see also FIG. 8). 
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Furthermore, e v e r. w u e r~ t h -3 ' ' M c s t u i m i r a r uop ,■ H _ u ^ 

Value" is low, e.g., less than aoout ie-iUu wnic:: rs 

probative evidence that the query sequence has previously 

oeen shown to be expressed - the top hit is highly unlikely 

5 exactly to match the probe sequence. 

First, such expression entries typically will not 

j_ i_ _ j _ ^ _ _ ^ ^ ^„A/^r n r^-h o-r,-rom -i - qprrnpnrp nrpsfint within 
nave Llit; liiLiuui'^ dii^/^^ - - 4 l- ■ 

the single exon probes listed in the Table. Second, even 
the ORF itself is unlikely in such cases to be present 

10 identically in the databases, since most of the EST and 
mRNA clones in existing databases include multiple exons, 
without any indication of the location of exon boundaries. 

As noted, the data presented in Table 4 represent 
_ r -,.._„„ r - yn^qct- nf the data cresent within the attached 

15 sequence listing. For each amplicon probe (SEQ ID NOs . : 1 
- 9,980) and probe exon (SEQ ID NOs.: 9,981 - 19,771, 
respectively), the sequence listing further provides, 
through iterated annotation fields <220> and <223>: 

(a) the accession number of the BAC from which 
20 the sequence was derived ("MAP TO"), thus providing a link 

t-ion anci other information about 

the genomic milieu of the probe sequence; 

(b) the most similar sequence provided by BLAST 
query of the EST database, with accession number and BLAST 

25 E value for the "hit"; 

(c) the most similar sequence provided by BLAST 
query of the GenBank NR database, with accession number and 
BLAST E value for the "hit"; and 

(d) the most similar sequence provided by BLASTX 
30 query of the SWISSPROT database, with accession number and 

BLAST E value for the "hit". 



EXAMPLE 5 

35 Genome-Derived Single Exon Probes Useful For Measuring 

99 
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Expression of Genes in Hurra:: Heart 

Table 4 (413 oases . presents expression, homology, and 
functional information for the genome-derived single^ exon 
5 probes that are expressed significantly in human heart. 
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yq84f07.r1 Soares fetal liver spleen 1NFLS Homo sapiens cDNA clone IMAGE;202501 5' | 


Homo sapiens matrix metalloproteinase MMP Rasi-1 gene, promoter region | 


Homo sapiens matrix metalloproteinase MMP Rasi-1 gene, promoter region | 


Hordeum vulgare receptor-like kinase LRK10 gene, partial cds | 


Hordeum vulgare receptor-like kinase LRK10 gene, partial cds | 
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IMMUNOGLOBULIN A1 PROTEASE PRECURSOR (IGA1 PROTEASE) | 


Aquifex aeolicus section 1 2 of 1 09 of the complete genome | 
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'Homo sapiens chromosome 21 segment HS21 C078 


CELL SURFACE GLYCOPROTEIN 1 PRECURSOR {OUTER LAYER PROTEIN D) (S-LAYER PROTLI! 
1) 


Human PFKL gene for liver-type 6-phosphofructokinase (EC 2.7.1 .11 ) exon 2 
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CM0-NN00O1-10030D-274-e11 NN0001 Homo sapiens cDNA 


FGF~1=fibroblast growth factor 1 [human, kidney, Genomic, 342 nt, segment 2 of 2] 


Homo sapiens LGMD2B gene 


H.sapiens DMA, DMB, HLA-Z1, IPP2, LMP2, TAP1, LMP7, TAP2, DOB, DQB2 and RING8, 9, 13 and 14 
genes 


nw21g02.st NCLCGAPJ3CB0 Homo sapiens cDNA clone 1MAGE:1241 138 3' similar to contains THR.t3 
THR repetitive element ; 


602038009F1 NCI_CGAP_Brn64 Homo sapiens cDNA clone 1MAGE:4185866 5* 


7i45e10.x1 Soares_NSF_F8_9W_OT_PA_P_S1 Homo sapiens cDNA done IMAGE:3524443 3' similar to 
contains MER29.b2 MER29 repetitfve element ; 


AV715377 DCB Homo sapiens cDNA clone DCBAIE03 5' 


Homo sapiens Xq pseud ©autosomal region; segment 1/2 


aj24c01.s1 Soares JestisJMHT Homo sapiens cDNA clone 1391232 3' similar to contains MER19.H MER1 
repetitive element ; 


aj24c01.s1 Soares Jestis__NHT Homo sapiens cDNA clone 1391232 3' similar to contains MER19.M MER1 
repetitive element ; 


RC4-CT0322-0801 00-01 3-d09 CT0322 Homo sapiens cDN A 


Homo sapiens TFF gene cluster for trefoil factor, complete cds 
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aj24c01.s1 Soares Jestis_NHT Homo sapiens cDNA clone 1391232 3' similar to contains MER19.H MER1 
repetitive element ; 


Human DNA, SINE repetitive element 


Saguinus oedipus gene for seminal vesicle secreted protein semenogeiin i 


hz71c09.x1 NCI_CGAP_Lu24 Homo sapiens cDNA clone IMAGE:3213424 3' 


yi72e03.r1 Soares placenta Nb2HP Homo sapiens cDNA done !MAGE:144796 3' 


H.sapiens DNA for endogenous retroviral (ike element 
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C LAI MS 

1. A spatially-addressable set of single exon nucleic -acid 
probes for measuring gene expression in a sample derived 
5 from human heart comprising a plurality single exon nucleic 
probes, said probes comprising any one of the nucleotide 
sequences set out in SEQ ID NOs : 1 - 3, 98 0 or a 
complementary sequence, or a portion of such a sequence. 

10 2. A spatially-addressable set of single exon nucleic acid 
. probes as claimed in claim 1 wherein each of said plurality 
of probes is separately and addressably amplifiable. 

3. A spatially-addressable set of single exon nucleic acid 
15 probes as claimed in claim 1 wherein each of said plurality 

of probes is separately and addressably isolatable from 
said plurality. 

4. A spatially-addressable set of single exon nucleic acid 
20 probes as claimed in any of claims 1 to 3 wherein said 

probes comprise any one of the nucleotide sequences set out 
in SEQ ID NOS.: 9,981 - 19,771. 

5. A spatially-addressable set of single exon nucleic acid 
25 probes as claimed in any of claims 1 to 4, wherein each of 

said plurality of probes is amplifiable using at least one 
common primer. 

6. A spatially-addressable set of single exon nucleic acid 
30 probes as claimed in any of claims 1 to 5 wherein the set 

comprises between 50 - 20,000 single exon nucleic acid 
probes . 

7. A spatially-addressable set of single exon nucleic acid 
35 probes as claimed in any of claims 1 to 6, wherein the 
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B. A spatially-addressable set of single exon nucleic acid 
5 probes as claimed in any of claims 1 to 7, wherein at lease 
5 0% of said single exon nucleic acid probes lack 
prckaryotic and bacteriophage vector sequence. 

S. A spatially-addressable set of single exon nucleic acid 
10 probes as claimed in any of claims 1 to 8, wherein at least 
50% of said single exon nucleic acid probes lack 
homopolymeric stretches of A or T. 



10. A spatially-addressable set of single exon nucleic acid 
15 probes as claimed in any of claims 1-9 characterised in 
that said set of probes is addressably disposed upon a 
substrate . 



11. A spatially-addressable set of single exon nucleic acid 
20 probes as claimed in claim 10 wherein said substrate is 

selected from class, amorphous silicon, crystalline silicon 
and plastic . - - 

12, A microarray comprising a spatially addressable set of 
25 single exon nucleic acid probes as claimed in any of claims 

1 - 11. 



13. A single exon nucleic acid probe for measuring human 
gene expression in a sample derived from human heart 
30 comprising a nucleotide sequence as set out in any of SEQ 
ID NOs.: 1 - 9,980 or a complementary sequence or a 
fragment thereof wherein said probe hybridizes at high 
stringency to a nucleic acid molecule expressed in the 
human heart. 
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1 - - L - single exon nucleic acid probe as claimed, in claim 13 

ID NOs.: 9,981 - 19,771 or a complementary sequence or a 
1 ragment thereof . 



15. A single exon nucleic acid probe for measuring human 
gene expression in a sample derived from human heart which 
is a nucleic acid molecule having a sequence encoding a 
peptide comprising a peptide sequence as set out in any of 
SEQ ID NOs.: 19,772 - 29,119, or a complementary sequence 
cr a fragment thereof wherein said probe hybridizes at high 
stringency to a nucleic acid expressed in the human heart. 



15 



16. A single exon nucleic acid probe as claimed in any one 
of claims 13 to 15 wherein said single exon nucleic acid 
probe comprises between 15 and 25 contiguous nucleotides of 
said SEQ ID NO. 



17. A single exon nucleic acid probe as claimed in any one 
20 of claims 13 to 15, wherein said probe is between 3 - 25 kb 

in length. 

18. A single exon nucleic acid probe as claimed in any one 
of claims 13 - 17, wherein said probe is DNA, RNA or PNA. 

25 

19. A single exon nucleic acid probe as claimed in any one 
of claims 13 - 18, wherein said probe is detectably 
labeled . 



30 20. A single exon nucleic acid probe as claimed in any one 
of claims 13 - 19, wherein said probe lacks prokaryotic and 
bacteriophage vector sequence. 

21. A single exon nucleic acid probe as claimed in any one 
■w.j.^iu^ *l u f wucLc^n iciia prooe icicks nomopoiymeric 
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-a r r.unan heart, comprising : 



a r a. r s 



collection of detestably labeled nucleic acids, 
said first collection of nuclei" ac^ns derived 
from mRNA of human tea re; and then 
measurine the label detect ably bound :o each probe o: 
said microarray. 



23. A method of identifying exons in a eukaryotic genome, 
comprising : 

algor i thmtcaliy predicting at least one exon from 
canonic sequence of said eukaryoce; and then 

detecting specific hybridization cf detectaoly labeled 
nucleic acids to a single exon probe, 
wherein said detect ably labeled nucleic acids are derived 
from mRNA from the heart of said eukaryote, said probe is a 
single exon probe having a fragment identical in sequence 
to, or complementary in sequence to, said predicted exon, 
said probe is included within a microarray according t: 
claim 12, and said fragment is selectively hybridizable at 
nigh stringency . 

24. A method of assigning exons to a single gene, 
oomprising : 

identifying a plurality of exons from genomic 
sequence according to the method of claim 23; and 
then 

measuring the expression of each of said exans in a 
plurality of tissues and/or cell types using 
hybridization to single exon microarrays having a 
probe with said exon, 
wherein a common pattern of expression of said exons ^n 
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said plurality of tissues and/or cell types indicates tria 

t Pi G C X O Pi 3 Si } GUiC Dc aSSIufiSG t C U 3 1 Pi C 1 G O G Pi G . 

25. A nucleic acid sequence as set our in any of SEQ ID 
5 i'JOs : 1 - 19,771 which encodes a peptide. 

26. A peotide encoded Icy a sequence as s e ^~ out in any of 
SEQ ID Nos: 1 - 19,771. 

10 27. A peptide comprising a sequence as set out in any of 
SEQ ID Nos: 19, 772 - 29, 119. 
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This International Searching Authority found multiple Inventions in this International application, as follows: 

see additional sheet 



1 * I I a " required additional search fees were timely paid by the applicant, this Internationa! Search Report covers all 
I — I searchable claims. 

2. I"]] As all searchable claims could be searched without effort Justifying an additional fee, this Authority did not invite payment 
of any ad ditto nal fee. 



3. ["71 As only some of the required additional search fees were timely paid by the applicant, this International Search Report 
La-J covers only those claims for which fees were paid, specifically claims Nos.: 

1-27 (partially) 



4. f No required additional search fees were timely paid by the applicant. Consequently, this International Search Report is 
— restricted to the Invention first mentioned in the claims; It is covered by claims Nos.: 



Remark on Protest 




The additional search fees were accompanied by the applicant's protest 



X No protest accompanied the payment of additional search fees. 



Form PCT/iSA/210 (continuation of first sheet (1)) (July 1998) 



International Application No. PCT/US 01 >00666 



FURTHER INFORMATION CONTINUED FROM PCT/ISA/ 210 



Continuation of Box 1.2 

Claims Nos.: 1-12, 15-21 (partially not searched) 



The following statements about the Impossibility of performing a 
meaningful search according to Art. 17(2) PCT are made for the subject 
matter for which a search has been performed and identified as the first 
and second inventions 1n form 206 PCT. 

Present claims 1-12 and 22-24 relate to an extremely large number of 
possible sets of nucleic acid probes comprising Seq. Id. 1 or 2 as well 
as microarrays comprising said sets. In fact, the claims contain so many 
possible permutations that a lack of clarity and conciseness within the 
meaning of Article 6 PCT arises to such an extent as to render a 
meaningful search of the claims Impossible. Consequently, the search for 
the sets of probes comprising Seq. Id. 1 or 2 has been limited to the 
Seq. Id. as such. 

Claims 1-3, 5, 6, 8-15 and 18-24 relate to portions or fragments of 
nucleic acids defined by Seq. Id. 1 or 2. The length or other similar 
characterizing features of the portions or fragments is not disclosed 
bringing the total number of possible prior art sequences to ' 
exceptionally high numbers. The shorter the length, the higher the 
possibility that an overflow of, In principle unrelated, sequences are 
retrieved, making the establishment of a meaningful International Search 
Report impossible. For this reason the search has been limited to 
portions or fragments of Seq. Id. 1 or 2 having a significant minimum 
length and being supported by the description, namely at least 15 
contiguous nucleotides (se claim 16). 

Claims 15-21 relate to an extremely large number of nucleic acid probes 

The probes are defined solely by their potential to code for peptide Seq 

Id. 19780. However, due to the degeneracy of the genetic code, every 

peptide 1s potentially coded by an extremely high number of nucleic add 

sequences. In fact, the claims contain so many potential nucleic acid 

sequences that a lack of clarity and conciseness within the meaning of 

Article 6 PCT arises to such an extent as to render a meaningful search 

over the whole scope of the claims impossible. The search has therefore 

been carried out for those parts of the claims which do appear to be 

clear and concise, namely the nucleic add sequences disclosed in the 

application and identified as encoding the referred peptide in table 4 

(Seq. Ids. 1 or 2 and 9989). ; 

Likewise, claim 26, which refers to peptides encoded by Seq. Id. 1 or 2 

and 9989, encompasses a high and undefined number of possible peptides ' 

Besides three possible reading frames deriving from the encoding nucleic 

acid strand, as well as three additional reading frames deriving from the J 

complementary nucleic acid strand, every possible fragment of these 1s 

being covered by the claim. This is due to the potential presence of stop f 

codons within any of the six possible reading frames which can not be 

established a priori. Thus, claim 26 contains so many potential peptide ! 

sequences that a lack of clarity and conciseness within the meaning of s 

Article 6 PCT arises to such an extent as to render a meaningful search 
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over the whole scope of the claim Impossible. Consequently the searrh 
has been carried out for those parts of the claim which do appear to be 
19780 conc1se ' namel y the P e P tide disclosed, identified by Seq Id 

The applicant's attention 1s drawn to the fact that claims, or parts of 
claims, relating to inventions in respect of which no International 
search report has been established need not be the subject of an 
nternatlonal preliminary examination (Rule 66.1(e) PCT) The aDDlicant 
Is advised that the EP0 policy when acting as an International 
Preliminary Examining Authority is normally not to carry out a 
preliminary examination on matter which has not been searched This is 
the case Irrespective of whether or not the claims are amended fo lowlna 
receipt of the search report or during any Chapter II procedure 9 
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This International Searching Authority found multiple (groups of) 
inventions in this international application, as follows: 

1. Claims: 1 - 27 (partially) 

Invention number 1: 

A nucleic acid probe comprising SEQ ID 1, complementary 
sequences or fragments thereof (1n particular comprising 
Seq. Id. 9989). Spatially addressable sets of probes 
comprising said sequence, mlcroarrays comprising said sets, 
a method for measuring gene expression, a method for 
identifying exons, a method for assigning exons to a single 
gene comprising the use of said arrays and peptide encoded 
by Seq. Id. 1 (in particular the one defined by Seq. Id. 
19780). 



2. Claims: 1 - 27 (partially) 
Invention 2 

A nucleic acid probe comprising SEQ ID 2, complementary 
sequences or fragments thereof (in particular comprising 
Seq. Id. 9989). Spatially addressable sets of probes 
comprising said sequence, microarrays comprising said sets, 
a method for measuring gene expression, a method for 
identifying exons, a method for assigning exons to a single 
gene comprising the use of said arrays and peptide encoded 
by Seq. Id. 2 (in particular the one defined by Seq. Id. 
19780). 



3. Claims: 1 - 27 (partially) 
Inventions 3 - 9980 

A nucleic add probe comprising SEQ ID n (where n ranges 
from 2 - 9980 according to the invention number above), 
complementary sequences or fragments thereof, in particular 
comprising the SEQ ID no. which is listed in the column 
"Exon Seq. Id. no." in the same row that contains Seq. Id, n 
1n table 4. Spatially addressable sets of probes comprising 
said sequence, microarrays comprising said sets, a method 
for measuring gene expression, a method for Identifying 
exons, a method for assigning exons to a single gene 
comprising the use of said arrays and peptide encoded by 
Seq. Id. n, in particular the one defined by the Seq. Id. 
no. in the column "0RF Seq. Id. no." of the same row where 
Seq. Id. n 1s listed. 
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